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Preface 



Networks are ubiquitous in science and have become a focal point for discussion in everyday 
life. Formal statistical models for the analysis of network data have emerged as a major 
topic of interest in diverse areas of study and most of these involve a form of graphical rep- 
resentation. Probability models on graphs date back to 1959. Along with empirical studies 
in social psychology and sociology from the 1960s, these early works generated an active 
"network community" and a substantial literature in the 1970s. This effort moved into the 
statistical literature in the late 1970s and 1980s, and the past decade has seen a burgeoning 
network literature in statistical physics and computer science. The growth of the World 
Wide Web and the emergence of online "networking communities" such as Facebook, MyS- 
pace, and Linkedln, and a host of more specialized professional network communities has 
intensified interest in the study of networks and network data. 

Our goal in this review is to provide the reader with an entry point to this burgeoning 
literature. We begin with an overview of the historical development of statistical network 
modeling and then we introduce a number of examples that have been studied in the network 
literature. Our subsequent discussion focuses on a number of prominent static and dynamic 
network models and their interconnections. We emphasize formal model descriptions, and 
pay special attention to the interpretation of parameters and their estimation. We end with 
a description of some open problems and challenges for machine learning and statistics. 
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Chapter 1 
Introduction 



Many scientific fields involve the study of networks in some form. Networks have been 
used to analyze interpersonal social relationships, communication networks, academic paper 
coauthorships and citations, protein interaction patterns, and much more. Popular books 
on networks and their analysis began to appear a decade ago, [see, e.g., 24; 50; 318; 319; 68] 
and online "networking communities" such as Facebook, MySpace, and Linkedln are an even 
more recent phenomenon. 

In this work, we survey selective aspects of the literature on statistical modeling and 
analysis of networks in social sciences, computer science, physics, and biology. Given the 
volume of books, papers, and conference proceedings published on the subject in these 
different fields, a single comprehensive survey would be impossible. Our goal is far more 
modest. We attempt to chart the progress of statistical modeling of network data over the 
past seventy years and to outline succinctly the major schools of thought and approaches 
to network modeling and to describe some of their interconnections. We also attempt to 
identify major statistical gaps in these modeling efforts. From this overview one might 
then synthesize and deduce promising future research directions. Kolaczyk [177] provides a 
complementary statistical overview. 

The existing set of statistical network models may be organized along several major 
axes. For this article, we choose the axis of static vs. dynamic models. Static network 
models concentrate on explaining the observed set of links based on a single snapshot of the 
network, whereas dynamic network models are often concerned with the mechanisms that 
govern changes in the network over time. Most early examples of networks were single static 
snapshots. Hence static network models have been the main focus of research for many 
years. However, with the emergence of online networks, more data is available for dynamic 
analysis, and in recent years there has been growing interest in dynamic modeling. 

In the remainder of this chapter we provide a brief historical overview of network modeling 
approaches. In subsequent chapters we introduce some examples studied in the network 
literature and give a more detailed comparative description of select modeling approaches. 
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1.1 Overview of Modeling Approaches 



Almost all of the "statistically" oriented literature on the analysis of networks derives from 
a handful of seminal papers. In social psychology and sociology there is the early work of 
Simmel and Wolff [268] at the turn of the last century and Moreno [221] in the 1930s as well as 
the empirical studies of Stanley Milgram [215; 298] in the 1960s; in mathematics/probability 
there is the Erdos-Renyi paper on random graph models [94]. There are other papers that 
dealt with these topics contemporaneously or even earlier. But these are the ones that appear 
to have had lasting impact. 

Moreno [221] invented the sociogram — a diagram of points and lines used to represent 
relations among persons, a precursor to the graph representation for networks. Luce and 
others developed a mathematical structure to go with Moreno's sociograms using incidence 
matrices and graphs (see, e.g., [202; 200; 201; 203; 244; 282; 11]), but the structure they 
explored was essentially deterministic. Milgram gave the name to what is now referred to as 
the "Small World" phenomenon — short paths of connections linking most people in social 
spheres — and his experiments had provocative results: the shortest path between any two 
people for completed chains has a median length of around 6; however, the majority of chains 
initiated in his experiments were never completed! (His studies provided the title for the 
play and movie Six Degrees of Separation, ignoring the compleity of his results due to the 
censoring.) White [321] and Fienberg and Lee [100] gave a formal Markov-chain like model 
and analysis of the Milgram experimental data, including information on the uncompleted 
chains. Milgram's data were gathered in batches of transmission, and thus these models can 
be thought of as representing early examples of generative descriptions of dynamic network 
evolution. Recently, Dodds et al. [86] studied a global "replication" variation on the Milgram 
study in which more than 60,000 e-mail users attempted to reach one of 18 target persons 
in 13 countries by forwarding messages to acquaintances. Only 384 of 24,163 chains reached 
their targets but they estimate the median length for completions to be 7, by assuming that 
attrition occurs at random. 

The social science network research community that arose in the 1970s was built upon 
these earlier efforts, in particular the Erdos-Renyi-Gilbert model. Research on the Erdos- 
Renyi-Gilbert model (along with works by Katz et al. [166; 168; 167]) engendered the field of 
random graph theory. In their papers, Erdos and Renyi worked with fixed number of vertices, 
N, and number of edges, E, and studied the properties of this model as E increases. Gilbert 
studied a related two-parameter version of the model, with N as the number of vertices 
and p the fixed probability for choosing edges. Although their descriptions might at first 
appear to be static in nature, we could think in terms of adding edges sequentially and thus 
turn the model into a dynamic one. In this alternative binomial version of the Erdos-Renyi- 
Gilbert model, the key to asymptotic behavior is the value A = pN. There is a "phase 
change" associated with the value of A = 1, at which point we shift from seeing many small 
connected components in the form of trees to the emergence of a single "giant connected 
component." Probabilists such as Pittel [243] imported ideas and results from stochastic 
processes into the random graph literature. 

Holland and Leinhardt [149] 's p\ model extended the Erdos-Renyi-Gilbert model to allow 
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for differential attraction (popularity) and expansiveness, as well as an additional effect due 
to reciprocation. The p\ model was log-linear in form, which allowed for easy computation of 
maximum likelihood estimates using a contingency table formulation of the model [101; 102]. 
It also allowed for various generalizations to multidimensional network structures [103] and 
stochastic blockmodels. This approach to modeling network data quickly evolved into the 
class of p* or exponential random graph models (ERGM) originating in the work of Frank 
and Strauss [110] and Strauss and Ikeda [287]. A trio of papers demonstrating procedures for 
using ERGMs [316; 241; 254] led to the wide-spread use of ERGMs in a descriptive form for 
cross sectional network structures or cumulative links for networks — what we refer to here 
as static models. Full maximum likelihood approaches for ERGMs appeared in the work of 
Snijders and Handcock and their collaborators, some of which we describe in chapter 3. 

Most of the early examples of networks in the social science literature were relatively small 
(in terms of the number of nodes) and involved the study of the network at a fixed point 
in time or cumulatively over time. Only a few studies (e.g., Sampson's 1968 data on novice 
monks in the monastery [259]) collected, reported, and analyzed network data at multiple 
points in time so that one could truly study the evolution of the network, i.e., network 
dynamics. The focus on relatively small networks reflected the state-of-art of computation 
but it was sufficient to trigger the discussion of how one might assess the fit of a network 
model. Should one focus on "small sample" properties and exact distributions given some 
form of minimal sufficient statistic, as one often did in other areas of statistics, or should 
one look at asymptotic properties, where there is a sequence of networks of increasing size? 
Even if we have "repeated cross-sections" of the network, if the network is truly evolving 
in continuous time we need to ask how to ensure that the continuous time parameters are 
estimable. We return to many of these question in subsequent chapters. 

In the late 1990s, physicists began to work on network models and study their properties 
in a form similar to the macro-level descriptions of statistical physics. Barabasi, Newman, 
and Watts, among others, produced what we can think of as variations on the Erdos-Renyi- 
Gilbert model which either controlled the growth of the network or allowed for differential 
probabilities for edge addition and/or deletion. These variations were intended to produce 
phenomena such as "hubs," "local clustering," and "triadic closures." The resulting models 
gave us fixed degree distribution limits in the form of power laws — variations on preferential 
attachment models ( "the rich get richer" ) that date back to Yule [329] and Simon [269] (see 
also [218]) — as well as what became known as "small world" models. The small- world 
phenomenon, which harks back to Milgram's 1960s studies, usually refers to two distinct 
properties: (1) small average distance and (2) the "clustering" effect, where two nodes with 
a common neighbor are more likely to be adjacent. Many of these authors claim that these 
properties are ubiquitous in realistic networks. To model networks with the small-world 
phenomenon, it is natural to utilize randomly generated graphs with a power law degree 
distribution, where the fraction of nodes with degree k is proportional to k~ a for some 
positive exponent a. Many of the most relevant papers are included in an edited collection 
by Newman et al. [231]. More recently this style of statistical physics models have been 
used to detect community structure in networks, e.g., see Girvan and Newman [122] and 
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Backstrom et al. [20], a phenomenon which has its counterpart description in the social 
science network modeling literature. 

The probabilistic literature on random graph models from the 1990s made the link with 
epidemics and other evolving stochastic phenomena. Picking up on this idea, Watts and 
Strogatz [320] and others used epidemic models to capture general characteristics of the 
evolution of these new variations on random networks. Durrett [91] has provided us with a 
book-length treatment on the topic with a number of interesting variations on the theme. 
The appeal of stochastic processes as descriptions of dynamic network models comes from 
being able to exploit the extensive literature already developed, including the existence and 
the form of stationary distributions and other model features or properties. Chung and Lu 
[69] provide a complementary treatment of these models and their probabilistic properties. 

One of the principal problems with this diverse network literature that we see is that, 
with some notable exceptions, the statistical tools for estimation and assessing the fit of 
"statistical physics" or stochastic process models is lacking. Consequently, no attention is 
paid to the fact that real data may often be biased and noisy. What authors in the network 
literature have often relied upon is the extraction of key features of the related graphical 
network representation, e.g., the use of power laws to represent degree distributions or mea- 
sures of centrality and clustering, without any indication that they are either necessary or 
sufficient as descriptors for the actual network data. Moreover, these summary quantities 
can often be highly misleading as the critique by Stouffer et al. [285, 286] of methods used 
by Barabasi [25] and Vazquez et al. [304] suggest. Barabasi claimed that the dynamics of a 
number of human activities are scale-free, i.e., he specifically reported that the probability 
distribution of time intervals between consecutive e-mails sent by a single user and time 
delays for e-mail replies follow a power-law with exponent —1, and he proposed a priority- 
queuing process as an explanation of the bursty nature of human activity. Stouffer et al. 
[286] demonstrated that the reported power-law distribution was solely an artifact of the 
analysis of the empirical data and used Bayes factors to show that the proposed model is 
not representative of e-mail communication patterns. See a related discussion of the poor fit 
of power laws in Clauset et al. [74]. There are several works, however, that try to address 
model fitting and model comparison. For example, the work of Williams and Martinez [323] 
showed how a simple two-parameter model predicted "key structural properties of the most 
complex and comprehensive food webs in the primary literature" . Another good example is 
the work of Middendorf et al. [214] where the authors used network motif counts as input to 
a discriminative systematic classification for deciding which configuration model the actual 
observed network came from; they looked at power law, small-world, duplication-mutation 
and duplication-mutation-complementation and other models (seven in total) and concluded 
that the duplication-mutation-complementation model described the protein-protein inter- 
action data in Drosophila melanogaster species best. 

Machine learning approaches emerged in several forms over the past decade with the 
empirical studies of Faloutsos et al. [97] and Kleinberg [173, 172, 174], who introduced 
a model for which the underlying graph is a grid — the graphs generated do not have a 
power law degree distribution, and each vertex has the same expected degree. The strict 
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requirement that the underlying graph be a cycle or grid renders the model inapplicable 
to webgraphs or biological networks. Durrett [91] treats variations on this model as well. 
More recently, a number of authors have looked to combine the stochastic blockmodel ideas 
from the 1980s with latent space models, model-based clustering [137] or mixed-membership 
models [9], to provide generative models that scale in reasonable ways to substantial-sized 
networks. The class of mixed membership models resembles a form of soft clustering [95] 
and includes the latent Dirichlet allocation model [41] from machine learning as a special 
case. This class of models offers much promise for the kinds of network dynamical processes 
we discuss here. 

1.2 What This Survey Does Not Cover 

This survey focuses primarily on statistical network models and their applications. As a 
consequence there are a number of topics that we touch upon only briefly or essentially not 
at all, such as 

• Probability theory associated with random graph models. The probabilistic literature 
on random graph models is now truly extensive and the bulk of the theorems and 
proofs, while interesting in their own right, are largely unconnected with the present 
exposition. For excellent introductions to this literature, see Chung and Lu [69] and 
Durrett [91]. For related results on the mathematics of graph theory, see Bollobas [43]. 

• Efficient computation on networks. There is a substantial computer science litera- 
ture dealing with efficient calculation of quantities associated with network structures, 
such as shortest paths, network diameter, and other measures of connectivity, central- 
ity, clustering, etc. The edited volume by Brandes and Erlebach [48] contains good 
overviews of a number of these topics as well as other computational issues associated 
with the study of graphs. 

• Use of the network as a tool for sampling. Adaptive sampling strategies modify the 
sampling probabilities of selection based on observed values in a network structure. 
This strategy is beneficial when searching for rare or clustered populations. Thompson 
and Seber [296] and Thompson [293] discuss adaptive sampling in detail. There is also 
related work on target sampling [294] and respondent-driven sampling [258; 305]. 

• Neural networks. Neural networks originated as simple models for connections in the 
brain but have more recently been used as a computational tool for pattern recognition 
(e.g., Bishop [38]), machine learning (e.g., Neal [228]), and models of cognition (e.g., 
Rogers and McClelland [257]). 

• Networks and economic theory. A relatively new area of study is the link between 
network problems, economic theory, and game theory. Some useful entrees to this 
literature are Even-Dar and Kearns [96], Goyal [131], Kearns et al. [169], and Jackson 
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[160], whose book contains an excellent semi-technical introduction to network concepts 
and structures. 

• Relational networks. This is a very popular area in machine learning. It uses proba- 
bilistic graphical models to represent uncertainty in the data. The types of "networks" 
in this area, such as Bayes nets, dependency diagrams, etc., have a different meaning 
than the networks we consider in this review. The main difference is that the net- 
works in our work are considered to "be given" or arising directly from properties of 
the network under study, rather than being representative of the uncertainty of the 
relationships between nodes and node attributes. There is a multitude of literature 
on relational networks, e.g., see Friedman et al. [112], Getoor et al. [117], Neville and 
Jensen [229]; Neville et al. [230], and Getoor and Taskar [116]. 

• Bi-partite graphs. These are graphs that represent measurement on two populations 
of objects, such as individuals and features. The graphs in this context are seldom 
the best representation of the data, with exception perhaps of binary measurements 
or when the true populations have comparable sizes. Recent work on exchangeable 
Rasch matrices is related to to this topic and potentially relevant for network analysis. 
Lauritzen [186, 187]; Bassetti et al. [29] suggest applications to bipartite graphs. 

• Agent-based modeling. Building on older ideas such as cellular automata, agent-based 
modeling attempts to simulate the simultaneous operations of multiple agents, in an 
effort to re-create and predict the actions of complex phenomena. Because the inter- 
est is often on the interaction among the agents, this domain of research has been 
linked with network ideas. With the recent advances in high-performance computing, 
simulations of large-scale social systems have become an active area of research, e.g., 
see [46]. In particular, there is a strong interest in areas that revolve around national 
security and the military, with studies on the effects of catastrophic events and bio- 
logical warfare, as well as computational explorations of possible recovery strategies 
[57; 59]. These works are the contemporary counterparts of more classical work at the 
interface between artificial intelligence and the social sciences [54; 56; 55]. 



Chapter 2 

Motivation and Dataset Examples 



2.1 Motivations for Network Analysis 

Why do we analyze networks? The motivation behind network analysis is as diverse as the 
origin of network problems within differing academic fields. Before we delve into details of 
the "how" of statistical network modeling, we start with some examples of the "why." This 
chapter also includes descriptions of popular datasets for interested readers who may wish 
to exercise their modeling muscles. 

Social scientists are often interested in questions of interpretation such as the meanings of 
edges in a social network [181]. Do they arise out of friendliness, strategic alliance, obligation, 
or something else? When the meaning of edges are known, the object is often to characterize 
the structure of those relations (e.g., whether friendships or strategic alliances are hierarchical 
or transitive). A large volume of statistically-oriented social science literature is dedicated to 
modeling the mechanisms and relations of network properties and testing hypotheses about 
network structure, see, e.g., [280]. 

Physicists, on the other hand, tend to be interested in understanding parsimonious mech- 
anisms for network formation [28; 235]. For example, a common modeling goal is to explain 
how a given network comes to have its particular degree distribution or diameter at time t. 

Several network analysis concepts have found niches in computational biology. For ex- 
ample, work on protein function classification can be thought of as finding hidden groups in 
the protein-protein interaction network [7; 8] to gain better understanding of underlying bi- 
ological processes. Label propagation (node similarity) in networks can be harnessed to help 
with functional gene annotation [226]. Graph alignment can be used to locate subgraphs 
that are common among species, thus advancing our understanding of evolution [105]. Mo- 
tif finding, or more generally the search for subgraph patterns, also has many applications 
[17]. Combining networks from heterogeneous data sources helps to improve the accuracy 
of predicted genetic interactions [327]. Heterogeneity of network data sources in biology 
introduces a lot of noise into the global network structure, especially when networks created 
for different purposes (such as protein co-regulation and gene co-expression) are combined. 
[225] addresses network de-noising via degree-based structure priors on graphs. For a review 
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of biological applications of networks, please see [332] . 

The task of finding hidden groups is also relevant in analyzing communication networks, 
e.g., in detecting possible latent terrorist cells [30]. The related task of discovering the "roles" 
of individual nodes is useful for identity disambiguation [36] and for business organization 
analysis [207]. These applications often take the machine learning approach of graph parti- 
tioning, a topic previously known in social science and statistics literature as blockmodeling 
[199; 89]. A related question is functional clustering, where the goal is not to statistically 
cluster the network, but to discover members of dynamic communities with similar functions 
based on existing network connectivity [122; 232; 234; 266]. 

In the machine learning community, networks are often used to predict missing informa- 
tion, which can be edge related, e.g., predicting missing links in the network [238; 73; 198], 
or attribute related, e.g., predicting how likely a movie is to be a box office hit [229]. Other 
applications include locating the crucial missing link in a business or a terrorist network, or 
calculating the probability that a customer will purchase a new product, given the pattern 
of purchases of his friends [142]. The latter question can more generally be stated as predict- 
ing individual's preferences given the preferences of her "friends" . This research direction 
has evolved into an area of its own under the name of recommender systems, which has 
recently received a lot of media attention due to the competition by the largest online movie 
rental company Netflix. The company has awarded a prize of one million dollars to a team 
of researchers that were able to predict customer ratings of movies with higher than 10% 
accuracy than their own in-house system [290]. 

The concept of information propagation also finds many applications in the network 
domain, such as virus propagation in computer networks [310], HIV infection networks [222; 
163; 164], viral marketing [87] and more generally gossiping [170]. Here some work focuses 
on finding network configurations optimal for routing, while other research assumes that the 
network structure is given and focus on suitable models for disease or information spread. 

2.2 Sample Datasets 

A plethora of data sets are available for network analysis, and more are emerging every year. 
We provide a quick guided tour of the most popular datasets and applications in each field. 

In his ground-breaking paper, Milgram [215] experimented with the construction of in- 
terpersonal social networks. His result that the median length of completed chains was 
approximately 6 led to the pop-culture coining of the phrase "six degrees of separation." 
Subjects of subsequent studies ranged from social interactions of monks [259], to hierar- 
chies of elephants [209; 303], to sexual relationships between adults of Colorado [176], to 
friendships amongst elementary school students [141; 299]. 

While a lot of biological applications focus on the study of protein-protein interaction 
networks [114; 115; 184; 248; 328], metabolic networks [158], functional and co-expression 
gene similarity networks and gene regulatory networks [111; 309], computer science applica- 
tions revolve around e-mail [207], the internet [97; 63; 151], the web [152; 13], academic paper 
co-authorship [127] and citation networks [204; 216]. Citation networks have a long history 
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of modeling in different areas of research starting with the seminal paper of de Sofia Price 
[83] and more recently in physics [190]. With the recent rise of online networks, computer 
science and social science researchers are also starting to examine blogger networks such as 
Live Journal, social networks found on Friendster, Facebook, Orkut, and dating networks such 
as Match.com. 

Terrorist networks (often simulated) and telecommunication networks have come under 
similar scrutiny, especially since the events of September 11, 2001 (e.g., see [182; 250; 249; 
62]). There has also been work on ecological networks such as foodwebs [323; 16], neuronal 
networks [188], network epidemiology [306], economic trading networks [123], transporta- 
tion networks (roads, railways, airplanes; e.g., [113]), resource distribution networks, mobile 
phone networks [92] and many others. 

Several network data repositories are available on public websites and as part of packages. 
For example, UCINet 1 includes a lot of well known smaller scale datasets such as the Davis 
Southern Club Women dataset [80], Zachary's karate club dataset [330], and Sampson's 
monk data [259] described below. Pajek 2 contains a larger set of small and large networks 
from domains such as biology, linguistics, and food-web. Additional datasets in a variety 
of domains include power grid networks, US politics, cellular and protein networks and 
others 3 . A collection of large and very large directed and undirected networks in the areas 
of communication, citation, internet and others are available as part of Stanford Network 
Analysis Package (SNAP) 4 . 

We now introduce six examples of networks studied in the literature, describing the data 
in reasonable detail and including graphs depicting the networks wherever feasible. For each 
network example we articulate specific questions of interest. 

2.2.1 Sampson's "Monastery" Study 

A classic example of a social network is the one derived from the survey administered by 
Samspon and published in his doctoral dissertation [259]. Figure 2.1 displays the network 
derived from the "whom do you like" sociometric relations in this dataset. Sampson spent 
several months in a monastery in New England, where a number of novices were preparing to 
join a monastic order. Sampson's original analysis was rooted in direct anthropological ob- 
servations. He strongly suggested the existence of tight factions among the novices: the loyal 
opposition (whose members joined the monastery first), the young turks (whose members 
joined later on), the outcasts (who were not accepted in either of the two main factions), and 
the waverers (who did not take sides). The events that took place during Sampson's stay at 
the monastery supported his observations. For instance, John and Gregory, two members 
of the young turks, were expelled over religious differences, and other members resigned 

^ttp : //www. anal ytictech. com/ucinet/ 

2 http : //vlado . f mf .uni-lj . si/pub/networks/data/ 

3 http : //www-personal . umich . edu/~me jn/netdata/ 
http: //cdg. Columbia. edu/cdg/datasets 
http: //www. nd. edu/ -networks/resources .htm 

4 http : / / snap . Stanford . edu/data/ 
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Figure 2.1: Network derived from "whom do you like" sociometric relations collected by 
Sampson. 

shortly after these events. About a year after leaving the monastery, Sampson surveyed 
all of the novices, and asked them to rank the other novices in terms of four sociometric 
relations: like/dislike, esteem, personal influence, and alignment with the monastic credo, 
retrospectively, at four different epochs spanning his stay at the monastery. 

The presence of a well defined social structure within the monastery (the factions) that 
can be inferred from responses to the survey, as well as the social dynamics of subtle ideo- 
logical conflicts that led to the dissolution of the monastic order, have much intrigued both 
statisticians and social scientists for the past four decades. Researchers typically consider 
the faction labels assigned by Sampson to the novices as the anthropological ground truth 
in their analysis. For example analyses, we refer to [103; 137; 81; 9]. 

2.2.2 The Enron Email Corpus 

The Enron email corpus has been widely studied in recent machine learning network litera- 
ture. Enron Corporation was an energy and trading company specializing in the marketing of 
electricity and gas. In 2000 it was the seventh largest company in the United States with re- 
ported revenues of over $100 billion. On December 2, 2001, Enron filed for bankruptcy. The 
sudden collapse cast suspicions over its management and prompted federal investigations. 
Thirty-four Enron officials were prosecuted and top Enron executives and associates were 
subsequently found to be guilty of accounting fraud. During the investigation, the courts 
subpoenaed extensive email logs from most of Enron's employees, and the Federal Energy 
Regulatory Commission (FERC) published the database online. 5 Subsequently, researchers 

5 http : //www . fere . gov/industries/ electric/ indus-act/wec/enron/ info-release . asp 
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Figure 2.2: E-mail exchange data among 151 Enron executives, using a threshold of a mini- 
mum of 5 messages for each link. Source: [153]. 

in the CALO (Cognitive Assistant that Learns and Organizes) project corrected integrity 
problems in the dataset. 6 The original FERC dataset contains 619,446 email messages (about 
92% of Enron's staff emails), and the cleaned-up CALO dataset contains 200,399 messages 
from 158 users. Another version of the data consists of the contents of the mail folders of 
the top 151 executives, containing about 225,000 messages covering a period from 1997 to 
2004. 7 Figure 2.2 and Figure 2.3 give network snapshots of the e-mail traffic among these 
151 executives with thresholds of 5 and 30 messages, respectively. 

Research activity on the Enron dataset range from document classification to social- 

6 http : //www. cs . emu. edu/~enron/ 

7 http : //www. isi . edu/~adibi/Enron/Enron.htm 




Figure 2.3: E-mail exchange data among 151 Enron executives, using a threshold of a mini- 
mum of 30 messages for each link. Source: [153]. 



13 



network analysis to visualization. A collection of papers working with the Enron corpus 
were gathered together in a special 2005 issue of Computational & Mathematical Organization 
Theory, see [58]. 

2.2.3 The Protein Interaction Network in Budding Yeast 

The budding yeast is a unicellular organism that has become a de-facto model organism 
for the study of molecular and cellular biology [47]. There are about 6,000 proteins in the 
budding yeast, which interact in a number of ways [64]. For instance, proteins bind together 
to form protein complexes, the physical units that carry out most functions in the cell 
[184]. In recent years, a large amount of resources has been directed to collect experimental 
evidence of physical proteins binding, in an effort to infer and catalogue protein complexes 
and their multifaceted functional roles [e.g. 98; 159; 300; 114; 143]. Currently, there are 
four main sources of interactions between pairs of proteins that target proteins localized 
in different cellular compartments with variable degrees of success: (i) literature curated 
interactions [248], (ii) yeast two-hybrid (Y2H) interaction assays [328], (iii) protein fragment 
complementation (PCA) interaction assays [291], and (iv) tandem affinity purification (TAP) 
interaction assays [115; 184]. These collections include a total of about 12,292 protein 
interactions [162], although the number of such interactions is estimated to be between 
18,000 [328] and 30,000 [307]. Figure 2.4 shows a popular image of the interaction network 
among proteins in the budding yeast, produced as part of an analysis by Barabasi and Oltvai 
[27]. 

Statistical methods have been developed for analyzing many aspects of this large protein 
interaction network, including de-noising [32; 8], function prediction [227], and identification 
of binding motifs [23] . 

2.2.4 The Add Health Adolescent Relationship and HIV Trans- 
mission Study 

The National Longitudinal Study of Adolescent Health (Add Health) is a study of adoles- 
cents in the United States drawn from a representative sample of middle, junior high, and 
highschools. The study focused on patterns of friendship, sexual relationships, as well as 
disease transmissions. To date, four waves of surveys have been collected over the course of 
fifteen years. 

Wave I surveys occurred between 1994 to 1995 and included 90,118 students from 145 
schools across the country. Each student completed an in-school questionnaire on his or her 
family background, school life and activities, friendships, and health status. Administrators 
from participating schools also completed questionnaires about student demography and 
school curriculum and services. In addition, 20,745 students were chosen for an in-home 
interview that included more sensitive topics such as sexual behavior. For 16 selected schools 
(two large and fourteen small), Add Health attempted to administer the in-home survey to 
all enrolled students. This saturated sample distinguishes itself from the ego-centric and 
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Figure 2.4: A popular image of the protein interaction network in Saccharomyces cerevisiae, 
also known as the budding yeast. The figure is reproduced with permission. Source: [27]. 

snowball samples collected from past studies; it allows for the construction of relationship 
networks with more accurate global characteristics. The fully observed friendship networks 
in all the schools are also a valuable resource and an important contribution of this work. 

Wave II data collection occurred 18-months after Wave I in 1996 and followed up on the 
in-home interviews. The dataset covered 14,738 adolescents and 128 school administrators. 
Based on the data collected from Wave I and II, Bearman et al. [31] constructed the timed 
sequence of relationship networks amongst students from the two large schools with saturated 
sampling. The resulting sexual relationship network bears strong resemblance to a spanning 
tree as opposed to previously hypothesized core or inverse-core structures 8 (See Figure 2.5.) 

Wave III interviews were conducted in 2001 and 2002 with topics including marriage, 



8 A core is a group of inter-connected individuals who sit at the center of the graph and interact with 
individuals on the periphery. An inverse core is a group of central individuals who are connected to those 
on the periphery but not to each other. 
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Figure 2.5: The Add Health sexual relationships network of US highschool adolescents. This 
figure is reproduced with permission. Source: Bearman et al. [31] 



childbearing, and sexually transmitted diseases. Of the original Wave I in-home respondents, 
15,170 were interviewed again for Wave III. Of these, 13,184 participants provided oral fluid 
specimens for HIV testing. Morris et al. [223] studied the prevalence of HIV infections among 
young adults based on data collected in Wave III. 

Wave IV interviews were conducted in 2007 and 2008 with the original Wave I respon- 
dents, who are now dispersed across the nation in all 50 states. Of the original respondents, 
92.5% were located and 80.3% were interviewed. The interview included a comprehensive 
survey of the social, emotional, spiritual, and physical aspects of health. Physical measure- 
ments, biospecimen, and geographical data were also collected. 

For detailed information about the data, as well to the public-domain and 

restricted-access datasets, see http://www.cpc.unc.edu/projects/addhealth. 

2.2.5 The Framingham "Obesity" Study 

One of the most famous and important epidemiological studies was initiated in Framingham, 
Massachusetts, a suburb of Boston, in 1948 with an originally enrolled cohort of 5209 people. 
In 1971 investigators initiated an "offspring" cohort study which enrolled most of the chil- 
dren of the original cohort and their spouses. Participants completed a questionnaire and 
underwent physical examinations (including measurements of height and weight) in three- 
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year periods beginning 1973, 1981, 1985, 1989, 1992, 1997, 1999. Christakis and Fowler [65] 
derive body mass index information on a total of 12,067 individuals who appeared in any of 
the Framingham Heart cohorts (one "close friend" for each cohort member). 9 There were 
38,611 observed family and social ties (edges) to the core 5,124 cohort members. 

Through a series of network snapshots and statistical analyses, Christakis and Fowler 
described the evolution of the "clustering" of obesity in this social network. In particular 
they claim to have examined whether the data conformed to "small-world," "scale-free," 
and "hierarchical" types of of random graph network models. Figure 2.6 depicts data on the 
largest connected subcomponent (the so-called giant component) for the network in 2000, 
which consists of 2200 individuals. Other analyses in their paper explore attributions of the 
individuals via longitudinal logistic-regression models with lagged effects. Subsequently, they 
have published similar papers focused on the dynamics of smoking behavior over time [66] 
and on happiness [67], both using the structure of Framingham "offspring" cohort. 

This work has come under criticism by others. For example Cohen-Cole and Fletcher 
note that there are plausible alternative explanations to the network structure based on con- 
textual factors [77] , and in a separate paper demonstrate that the same methodology detects 
"implausible" social network effects for such medical conditions as acne and headaches as 
well as for physical height [78]. The authors answer to these criticisms can be found in [108]. 
The question of the magnitude and significance of social network effects is still a subject of 
an ongoing debate. 

2.2.6 The NIPS Paper Co- Authorship Dataset 

The NIPS dataset contains information on publications that appeared in the Neural In- 
formation Processing Systems (NIPS) conference proceedings, volumes 1 through 12, cor- 
responding to years 1987-1999 — the pre-electronic submission era. The original collection 
contained scanned full papers made available by Yann LeCunn. Sam Roweis subsequently 
processed the data to glean information such as title, authorship information, and word 
counts per document. In total, there are 2,037 authors and 1,740 papers with an average of 
2.29 authors per paper and 1.96 papers per author. The NIPS database is available from 
Sam Roweis' website 10 in raw and MATLAB formats along with a detailed description and 
information on its construction. 

Various authors have used the NIPS data to analyze author-to-author connectivity in 
static [126] as well as dynamic settings [264]. Li and McCallum [197] modeled the text of the 
documents and Sarkar et al. [265] analyzed the two-mode network (author-word-author) in 
a dynamic context. In Figure 2.7 we reproduce a graphic illustration of the inferred dynamic 
evolution of the network from [263]. 



9 A body-mass index value (weight in kg. divided by the square of the height in meters) of 30 or more 
was taken to indicate obesity. 

10 http : //www . cs . toronto . edu/~roweis/data. html 
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Figure 2.6: Obesity network from Framingham offspring cohort data. Each node represents 
one person in the dataset (a total of 2200 in this picture). Circles with red borders denote 
women, with blue borders - men. The size of each circle is proportional to the body-mass 
index. The color inside the circle denotes obesity status - yellow is obese (body-mass index 
> 30, green is non-obese. The colors of ties between nodes indicate relationships - purple 
denotes a friendship or marital tie and orange is a familial tie. This figure is reproduced 
with permission. Source: [65]. 
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Figure 2.7: NIPS paper co-authorship data. Each point represents an author. Two authors 
are linked by an edge if they have co-authored at least one paper at NIPS. Left: 1991-1994. 
Right: 1995-1998. Each graph contains all the links for the selected period. Several well 
known people in the Machine Learning field are highlighted. The size of the circles around 
selected individuals depend on their number of collaborations. Colors are meant to facilitate 
visualization. This figure is reproduced with permission. Source: [263]. 
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Chapter 3 

Static Network Models 



A number of basic network models are essentially static in nature. The statistical activities 
associated with them focus on certain local and global network statistics and the extent to 
which they capture the main elements of actual realized networks. In this chapter, we briefly 
summarize two lines of research. The first originates in the mathematics community with 
the Erdos-Renyi-Gilbert model and led to two types of generalizations: (i) the "statistical 
physics" generalizations that led to power laws for degree distributions — the so-called scale- 
free graphs, and (ii) the exchangeable graph models that introduce weak dependences among 
the edges in a controlled fashion, which ultimately lead to a range of more structured con- 
nectivity patterns and enable model comparison strategies rooted in information theory. A 
second line of research originated in the statistics and social sciences communities in response 
to a need for models of social networks. The pi model of Holland and Leinhardt, which in 
some sense generalizes the Erdos-Renyi-Gilbert model, and the more general descriptive fam- 
ily of exponential random graph models effectively initiate this line of modeling. Some of 
these models also have a generative interpretation that allows us to think about their use in 
a dynamic, evolutionary setting. We define and discuss popular dynamic interpretations of 
the data generating process, including the generative interpretation, in chapter 4. 



3.1 Basic Notation and Terminology 

In theoretical computer science, a graph or network G is often defined in terms of nodes and 
edges, G = G(Af,£), where Af is a set of nodes and £ a set of edges, and iV = \J\f\, E — \£\. 
In the statistical literature, G is often defined in terms of the nodes and the corresponding 
measurements on pairs of nodes, G = G(Af, y). y is usually represented as a square matrix 
of size N x N. For instance, y may be represented as an adjacency matrix Y with binary 
elements in a setting where we are only concerned with encoding presence or absence of 
edges between pairs of nodes. For undirected relations the adjacency matrix is symmetric. 

Henceforth we will work with graphs mostly defined in terms of its set of N nodes and its 
binary adjacency matrix Y containing = E directed edges. Nodes in the network may 

represent individuals, organizations, or some other kind of unit of study. Edges correspond 
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to types of links, relationships, or interactions between the units, and they may be directed, 
as in the Holland-Leinhardt model, or undirected, as in the Erdos-Renyi-Gilbert model. 

A note about terminology: in computer science, graphs contain nodes and edges; in 
social sciences, the corresponding terminology is usually actors and ties. We largely follow 
the computer science terminology in this review. 

3.2 The Erdos-Renyi-Gilbert Random Graph Model 

The mathematical biology literature of the 1950s contains a number of papers using what we 
now know as the network model G(N,p), which for a network of N nodes sets the probability 
of an edge between each pair of nodes equal to p, independently of the other edges, e.g., see 
Solomonoff and Rapoport [281] who discuss this model as a description of a neural network. 
But the formal properties of simple random graph network models are usually traced back to 
Gilbert [119], who examined G(N,p), and to Erdos and Renyi [93]. The Erdos-Renyi-Gilbert 
random graph model, G(N, E), describes an undirected graph involving N nodes and a fixed 
number of edges, E, chosen randomly from the ("^) possible edges in the graph; an equivalent 

interpretation is that all (Q/) graphs are equally likely. 1 The G(N,p) model has a binomial 
likelihood where the probability of E edges is 

i(G(N,p) has E edges | p) = p E (l -p)^~ E , 
or, equivalently, in terms of the N x N binary adjacency matrix Y 

£(Y\p) = Il ¥j l> Y H ■!>)■ Y ■ 

The likelihood of the G(N, E) model is a hypergeometric distribution and this induces a uni- 
form distribution over the sample space of possible graphs. The G(N,p) model specifies the 
probability of every edge, p, and controls the expected number of edges, p- (£). The G(N, E) 
model specifies the number of edges, E, and implies the expected "marginal" probability of 
every edge, E/{ N 2 ). The G(N,p) model is more commonly found in modern literature on 
random graph theory, in part because the independence of edges simplifies analysis [see, e.g., 
69; 91]. 

Erdos and Renyi [94] went on to describe in detail the behavior of G(N, E) as p = Ej (^) 
increases from to 1. In the binomial version the key to asymptotic behavior is the value of 
A = pN . One of the important Erdos- Renyi results is that there is a phase change at A = 1, 
where a giant connected component emerges while the other components remain relatively 
small and mostly in the form of trees [see 69; 91]. More formally, 

PI. If A < 1, then a graph in G(N,p) will have no connected components of size larger 
than O(logiV), a.s. as iV — > <x>. 

P2. If A = 1, then a graph in G(N,p) will have a largest component whose size is of 
0(N 2 / 3 ), a.s. as n — > oo. 

1 Both versions are often referred to as Erdos-Renyi models in the current literature. 
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P3. If A tends to a constant c > 1, then a graph in G(N,p) will have a unique "giant" 
component containing a positive fraction of the nodes, a.s. as N — > oo. No other 
component will contain more than O(logiV) nodes, a.s. as iV —>■ oo. 

A summary of a proof using branching processes is given in the appendix of this chapter. 
Some of the proof concepts will be useful for discussion of exchangeable graph models in 
section 3.3. 

The Erdos-Renyi-Gilbert model has spawned an enormous number of mathematical pa- 
pers that study and generalize it, e.g., see [43]. But few of them are especially relevant for 
the actual statistical analysis of network data. In essence, the model dictates that every 
node in a graph has approximately the same number of neighbors. Empirically there are 
few observed networks with such simple structure, but we still need formal tools for decid- 
ing on how poor a fit the model provides for a given observed network, and what kinds of 
generalized network models appear to be more appropriate. This has led to two separate 
literatures, one of which has focused on formal statistical properties associated with estimat- 
ing parameters of network models — the pi and exponential random graph models described 
below — and a second that identifies selected predicted features of models and empirically 
checks observed networks for those features. The latter is largely associated with papers 
emanating from statistical physics and computer science, several of which are described in 
detail in chapter 4. 



The exchangeable graph model provides the simplest possible extension of the original ran- 
dom graph model by introducing a weak form of dependence among the probability of sam- 
pling edges (i.e., exchangeability) that is due to non- observable node attributes, in the form 
of node-specific binary strings. This extension helps focus the analysis, whether empirical or 
theoretical, on the interplay between connectivity of a graph and its node-specific sources of 
variability [1; 5]. 

Consider the following data generating process for an exchangeable graph model, which 
generates binary observations on pairs of nodes. 

1. Sample node-specific K-bit binary strings for each node n 6 M 

b n ~ unif (vertex set of i^-hypercube), 

2. Sample directed edges for all node pairs n,m G H X N 



where 6x:JV are -K-bit binary strings 2 , and q maps pairs of binary strings into the [0, 1] interval. 
This generation process induces weakly dependent edges. The edges are conditionally inde- 

2 Note that the space of K-bit binary strings can be mapped one-to-one to the vertex set of the K- 
hypercube, i.e., the unit hypercube in K dimensions. 



3.3 The Exchangeable Graph Model 




23 



pendent given the binary string representations of the incident nodes. They are exchangeable 
in the sense of De Finetti [82]. 

From a statistical perspective, the exchangeable graph model we survey here [1; 5] pro- 
vides perhaps the simplest step- up in complexity from the random graph model [93; 119]. In 
the data generation process, the bit strings are equally probable but the induced probabilities 
of observing edges are different. A class of random graphs with such a property has been 
recently rediscovered and further explored in the mathematics literature, where the class of 
such graphs is referred to as inhomogeneous random graphs [45] . An alternative and arguably 
more interesting set of specifications can be obtained by imposing dependence among the 
bits at each node. This can be accomplished by sampling sets of dependent probabilities 
from a family of distributions on the unit hypercube, p n G [0, 1} K , and then sampling the 
bits independently given these dependent probabilities. 

1. Sample node-specific X-bit binary strings for each node n G M 

p n ~ hypercube (/2, a, a), where a > (K — 1) ■ a > 0, 
b nk ~ Bern (p nk ), for k = 1, . . . , K 

2. Sample directed edges for all node pairs n, m G jV x M 



In the hypercube distribution 3 , jl,a,a control the frequency, variability and correlation of 
the bits within a string, respectively; and q maps binary pairs of strings into the unit interval. 

In the exchangeable graph model, the number of bits, K, captures the complexity of 
the graph. For instance, for K < N the model provides a compression of the graph. For 
directed graphs the function q is asymmetric in the arguments. The sparsity of the bit strings 
is controlled by the parameter a > 0. A larger value of a leads to larger negative correlation 
among the bits and thereby a sparser network. In such an exchangeable graph model there 
are two main sources of variability: (i) the probability of an edge decreases with the number 
of bits K, as more complexity reduces the chances of an edge, and (ii) the probability of an 
edge increases with 1/a, as concentrating density in the corners of the unit .ff -hypercube 
improves the chances of an edge. While this model does not quite fit the definition of non- 
homogeneous models of Bollobas et al. [45], it is tractable enough to allow the analysis of 
the giant component in (K, a) space, by leveraging the branching process strategy developed 
by Durrett [91] (see the appendix at the end of the chapter). As in Durrett's analysis, the 

3 The hypercube distribution can be obtained using a hierarchical construction as follows. Sample u ~ 
Normal^ (ft, E), where u S M. k and E^ = a, Ey = (3 for i ^ j. Then define p, = (1 + e~ Ui )~ 1 for i = 1 . . . k. 
The resulting density for p, where p £ [0, 1] is 




fp(p\ &at,0) 



|2ttE|-3 



exp f - - (log(p/(l - p)) - fT)' E- 1 (log(p/(l - p)) - ft) 



U. d j=iPj( l -Pi) 



For more details see [4]. 
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giant component emerges because a number of smaller components must intersect with high 
probability. In exchangeable graph models however, the giant component has a peculiar 
structure; connected components are themselves connected to form the giant component as 
soon as bit strings that match on two bits appear with high probability. Figure 3.1 provides 
a graphical illustration of this intuition. Nodes that bridge two connected components are 




Figure 3.1: Left panel. An example adjacency matrix that correspond to a fully connected 
component among 100 nodes. Right panel. The clustering coefficient as a function of a on a 
sequence of graphs with 100 nodes. Here a = 12, and log(/ij) = 4 for every i = 1 . . . K . 

evident in the left panel. Note that there are no nodes that bridge three components, as bit 
strings that match on three bits is an unlikely event in a graph with 100 nodes. 

Given a graph, we can infer the corresponding set of binary strings from data. The 
likelihood that correspond to an exchangeable graph model is simple to write, 

£(Y\6)= I db 1:N (H Pr (Y n , m \b n ,b m ,q)l[Pr (b n \6)), 

n,m n 

where 9 = (/I, a, a) or an appropriate set of parameters. We can apply standard inference 
techniques [2; 9]. Fitting an exchangeable graph model allows us to assess the complexity 
of an observed graph, leveraging notions from information theory. For instance, we can 
use the minimum description length (MDL) principle to decide how many bits we need to 
explain the observed connectivity patterns with high probability. We can also quantify how 
much information is retained at different bit-lengths, and plot the corresponding information 
profile for K < N and an entropy histogram for any given value of K. 

The exchangeable graph model allows for algorithmic comparison of any set of statistical 
models that are proposed to summarize an observed graph. As an illustration, consider 
an observed graph G and two alternative models A and B. Rather than comparing how 
well models A and B recover the degree distribution of G or other graph statistics, and 
independently of whether it makes sense to directly compare the two likelihoods of A and B 
(in fact, these models need not have a likelihood), we can proceed as follows. 
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1. Given a graph G, fit models A(Q a ) and B(®b) to obtain an estimate of their parameters 
Q a Est and Q b Est respectively. 

2. Sample M graphs at random from the support of A(Q a Est ) and B(Ob Est ). 

3. Compute the distributions of summary statistics based on notion from information 
theory, such as information profile and entropy histogram, corresponding to the 2M 
graphs sampled from A and B. 

4. Compare models in terms of the distribution on the statistics above, such as the com- 
plexity of the two models' supports and their similarity to the complexity of G. 

The exchangeable graph model also allows for evaluation of the distribution of the number 
of bit strings with / matching bits, for any integer I < K. In theory this distribution leads to 
expectations on the number of nodes that bridge I communities, where the members of each 
community have only one out of / matching bits. In practice, we may want to specify K in 
advance so that each bit corresponds to a well defined property. For instance, in applications 
to biology, nodes may correspond to proteins and the K bits encode presence or absence of 
specific protein domains. The distribution on the number of I matchings leads to p-values 
that summarize how unexpected it is to observe binding events among a set of proteins that 
share a certain combination of domains. 

Overall, the exchangeable graph model introduces weak dependences among the edges 
of a random graph in a controlled fashion, which ultimately lead to a range of more struc- 
tured connectivity patterns and enable model comparison strategies rooted in notions from 
information theory. The focus here is not on modeling per se. In fact, the model is kept 
as simple as possible. Rather, the focus is on modeling as a means to establish a technical 
link between graph connectivity and node attributes. This technical link is useful to address 
some of the issues listed in Chapter 5. For more details see [5]. 

There exist other complex graph models in the network analysis literature that induce 
exchangeable or partially exchangeable edges. We will discuss latent space models [146; 
137] and stochastic blockmodels [236; 7; 9] as examples. These models can all be traced 
back to an original analysis of multivariate sociometric relations, measurements of relations 
represented as vectors rather than scalars, that was developed a few decades ago [103] . The 
difference in these models and the exchangeable graph model lies in the interpretation of 
the latent variables and in the goal of the analysis. Latent space models interprets the 
latent variables as latent positions in a social space, and blockmodels interpret the latent 
membership vectors in terms of functional association or community membership. In the 
exchangeable graph model, the latent binary strings do not carry semantic meaning, rather 
they are mathematical artifacts that help to represent a graph and induce an expressive 
parametric family of distributions [15; 165; 5]. Most importantly, the exchangeable graph 
model is meant to be a tool to represent and explore the space of connectivity patterns in 
a smooth, principled semi-parametric fashion. In this regard, exchangeable graph models 
differ substantially from latent space models or stochastic blockmodels. 
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3.4 The pi Model for Social Networks 

A conceptually separate thread of research developed in parallel in the statistics and social 
sciences literature, starting with the introduction of the pi model. Consider a directed graph 
on the set of n nodes. Holland and Leinhardt's p\ model focuses on dyadic pairings and 
keeps track of whether node i links to j, j to i, neither, or both. It contains the following 
parameters: 

• 9: a base rate for edge propagation, 

• aj (expansiveness) : the effect of an outgoing edge from i, 

• Pj (popularity): the effect of an incoming edge into j, 

• pij (reciprocation/mutuality): the added effect of reciprocated edges. 

Let P(0, 0) be the probability for the absence of an edge between i and j, Pi j (1,0) the 
probability of i linking to j ("1" indicates the outgoing node of the edge), Py(l,l) the 
probability of % linking to j and j linking to i. The p\ model posits the following probabilities 
(see [149]): 



In this representation of pi, \j is a normalizing constant to ensure that the probabilities 
for each dyad (i,j) add to 1. For our present purposes, assume that the dyad is in one 
and only one of the four possible states. The reciprocation effect, p^, implies that the odds 
of observing a mutual dyad, with an edge from node i to node j and one from j to i, is 
enhanced by a factor of exp(py) over and above what we would expect if the edges occured 
independently of one another. 

The problem with this general p\ representation is that there is a lack of identification of 
the reciprocation parameters. The following special cases of p\ are identifiable and of special 
interest: 

1. oti = 0, (3j = 0, and p^ = 0. This is basically an Erdos-Renyi-Gilbert model for 
directed graphs: each directed edge has the same probability of appearance. 

2. p^ = 0, no reciprocal effect. This model effectively focuses solely on the degree distri- 
butions into and out of nodes. 

3. p^ = p, constant reciprocation. This was the version of p\ studied in depth by Holland 
and Leinhardt using maximum likelihood estimation. 



log P^ (0,0) 
log Pij (1,0) 
log Pj (0,1) 
log Pij (1,1) 




(3.1) 
(3.2) 
(3.3) 
(3.4) 
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4. pij = p + pi + pj, edge- dependent reciprocation. Fienberg and Wasserman [101, 102] 
described this model and how to find maximum likelihood estimate for the parameters. 



In the constant reciprocation setting, the elevated probability of reciprocal edges does not 
depend on the dyad, whereas edge- dependent reciprocation dictates multiplicative increases 
of the reciprocation probability based on node-specific parameters. 

The likelihood function for the p\ model is clearly in exponential family form. For the 
constant reciprocation version, we have 



where a "+" denotes summing over the corresponding subscript. The minimal sufficient 
statistics (MSSs) are y i+ , y + j, and Y^ijVijVji- Then using the usual exponential family 
theory we know that the likelihood equations are found by setting the MSSs equal to their 
expectations (cf. [308]). Holland and Leinhardt gave an explicit iterative algorithm for 
solving these equations with the added constraints that the probabilities for each dyad add 
to 1. 

A major problem with the p\ and related models, recognized by Holland and Leinhardt, 
is the lack of standard asymptotics to assist in the development of goodness-of-fit procedures 
for the model. Since the number of {a{\ and {/%} increase directly with the number of nodes, 
we have no consistency results for the maximum likelihood estimates, and no simple way 
to test for p = 0, for example. A few ad hoc fixes have been suggested in literature, the 
most direct of which deals with the problem by setting subsets of the {a,} and {/3j} equal 
to one another (see the discussion of blockmodels below) or by considering them as arising 
from common prior distributions (see, e.g., [311]). Fienberg et al. [104] recently suggested 
the use of tools from algebraic statistics to find Markov basis generators for the model and 
the conditional distribution of the data given the MSSs. 

Fienberg and Wasserman proposed a slightly different dyad-based data representation 
for the pi model. Conceptually, the dyad considers the two directed measurements together: 
{Dij = (yij,yji)}. In their work, they define 



where k and I take the values of 1 or 0. This representation converts the dyad {D^ = 
{HijiUji)} i n to a 2 x 2 table with exactly one entry of 1 and the rest 0. Now if we collect the 
data for the n(n — l)/2 dyads together, they form annxnx2x2 incomplete contingency 
table with "structural" zeros down the diagonal of the nxn marginal (i.e., no self loops), and 
"duplicate" data for each dyad above and below the diagonal. In this redundant 4-way table, 
the model of no second-order interaction corresponds to p\ with constant reciprocation, and 
the standard iterative proportional fitting algorithm 4 can be used to compute the maximum 

4 For details on IPF for contingency tables, see [39; 99] 



log Pr Pl (y) oc y ++ 9 + ^ y i+ ai + ^ y +j (3j + y^yjip 



(3.5) 





1 if D(y ij ,y ji ) = (k,l), 
otherwise, 
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likelihood estimates. Fienberg et al. [103] show that same type of contingency table rep- 
resentation also works for the correlated p\ model for multiple relations, and Meyer [213] 
provides a technical statistical rational for these contingency table representations. 

Holland and Leinhardt analyzed Sampson's monk dataset (c.f. subsection 2.2.1 and [259]) 
using the p\ model. Fienberg et al. [103] analyzed an 8-relation version of the Sampson data 
(4 positive and 4 negative) using their multiple-relation generalizations of p±, but focusing 
on an aggregation of the 18 monks into the three blocks identified in [322]: a top-esteemed 
block of 7 monks with an unambivalently positive attitude towards itself, in conflict with a 
more ambivalent block of 7, and a block of 4 outcasts and waiverers. 

3.5 p2 Models for Social Networks and Their Bayesian 
Relatives 

In the statistical literature, the notion of fixed effects typically refers to a set of unknown 
constant quantities, each of which is used to partly explain the variability of the observations 
corresponding to a unit of analysis, e.g., an individual or a pair of individuals. This contrasts 
the notion of random effects, which refers to a set of unknown variable quantities that serve 
a similar purpose and are drawn from the same underlying distribution. 

The pi model treats expansiveness, {ctj}, and popularity, {/3j}, as fixed effects associated 
with unique nodes in the network. Often it makes more sense to think about the ensemble 
of expansiveness and/or popularity effects as a sample drawn from some underlying distri- 
bution, and then estimate the parameters of that distribution. This type of random effects 
network model has been developed in a series of papers by Snijders and his collaborators 
and they refer to it as the P2 network model, e.g., see van Duijn et al. [301]. It is reasonably 
straightforward to take any of the multivariate variations on p\ and generate a family of 
multi-level models with mixtures of fixed and random effects in the spirit of P2, e.g., see 
Zijlstra et al. [333]. 

Bayesian extensions of frequentist approaches often involve positing a statistical model 
for fixed effects, thus converting them into random effects. The principal distinction between 
the P2 models and Bayesian extensions of p\ is that, in the latter, the other unknown constant 
quantities, \,9,p, may be also converted into random effects. Furthermore, there may be 
additional levels to the multilevel hierarchy in these models, and there are prior distributions 
on the parameters at the highest level of the hierarchy (cf. Gill and Swartz [121]; Wang and 
Wong [311]). It should come as no surprise that authors using the Bayesian approach have 
worked with Monte Carlo Markov chain (MCMC) methods as have those using versions of 

Pi- 

MCMC implementations of P2 models in STOCNET 5 are well-suited for networks with a 
relatively large number of nodes, e.g., Zijlstra et al. [333] study network data from 20 Dutch 
high schools with a total of 1,232 pupils. 

5 STOCNET is a freestanding Software package for the statistical analysis of social networks, available at 
http : //stat . gamma . rug . nl/ stocnet/. 
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3.6 Exponential Random Graph Models 



Under the assumption that two possible edges are dependent only if they share a common 
node, 6 Frank and Strauss [110] proved the following characterization for the probability 
distribution of undirected Markov graphs: 



n-l 



Pr {Y = y} = exp ( £ 9 k S k (y) + rT(y) + ^(9, r) ) y e y, (3.6) 

k=l 

where 9 := {9k} and r are parameters, i/j(9, r) is the normalizing constant, and the statistics 
Sk and T are counts of specific structures such as edges, triangles, and fc-stars: 

number of edges: S^y) = Ei<i<,< n ^i' 

number of fc-stars (k > 2): S k (y) = Za<i<nOfc + )' 

number of triangles: T(y) = Yjx<i<j<h<nVH VihVjh- 

Note that there is a dependence structure to the parameters of this model, with edges being 
contained in 2-stars, and 2-stars being contained in both triangles and three-stars. Certain 
variations of this ERGM model that involve directed edges are natural generalizations of the 
Pi model. Alternative parameterizations that go beyond Markov graph models have been 
recently proposed, e.g., see [280; 317; 21]. 

Frank and Strauss [110] worked mainly with the three parameter model where 9$, . . . , 9 n -\ - 
0. They proposed a pseudo-likelihood parameter estimation method [287] that maximizes 

= log ( ^ r = yi 3 I Yuv = Vuv for a11 u < v > u ) ^ (hi)} ) • 

i<j 

Wasserman and Pattison [316] proposed the current formulation of these Exponential Random 
Graph Models (ERGM), also referred to as p* models, as a generalization of the Markov 
graphs of Frank and Strauss. For both directed and undirected graphs, they maintain a 
similar characterization of the probabilities where the statistics S k and T are replaced by 
arbitrary statistics U. This leads to likelihood functions of the form 

Pr {Y = y} = exp ( 9 T u(y) - ip(9) \ . (3.7) 

The statistics u(y) are counts of graph structures. Although they are not independent— 
they count overlapping sets of edges — they are assumed independent in the pseudo-likelihood. 
Ignoring these correlations is a bad idea; it causes extreme sensitivity of the predicted number 
of edges to small changes in the value of certain parameters [302] . Park and Newman [240] 
formally characterized sensitivity issues. Snijders et al. [280] recently proposed a variant of 



D This is the definition of Markov property for spatial processes on a lattice in [33]. 
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these models where the major problem of double-counting is mitigated but not overcome. 
Hunter and Handcock [155] estimate likelihood ratios for nearby {8i} using a MCMC proce- 
dure related to the work of Geyer and Thompson [118]. Their estimation procedure can be 
used for models based on distributions in the curved exponential family. 

Robins et al. [256] describe problems associated with the estimation of parameters in 
many ERGMs, involving near degeneracies of the likelihood function and thus of methods 
used to estimate parameters using maximum likelihood. For example, for a certain com- 
bination of ERGM statistics, the likelihood function may have multiple, clearly distinct 
modes, and there are very few network configurations — often radically different from each 
other — that have non-zero probabilities. This is a topic of current theoretical and empirical 
investigation rooted in the theory of discrete exponential families [136; 251]. For a discus- 
sion of mixing times of MCMC methods for ERGMs and the relevance to convergence and 
degeneracies, see [35]. 

There are two carefully constructed packages of routines that are available for analyzing 
network data using ERGMs: statnet 7 and SIENA 8 . These packages focus on the use of 
MCMC methods for estimating the parameters in ERGMs. 

Remark. It is possible to express the current formulation of exponential random graphs 
using the formalism of undirected graphical models and the Hammersley- Clifford theorem 
[76; 33]. We can write the likelihood of an arbitrary undirected graph as 



where y c denotes the nodes in clique c, 6 C denotes the corresponding set of parameters, ip are 
non-normalized potentials over the cliques, and z = Ylcec VKyel^c) is the normalization 
constant. If the likelihood is in the exponential family, then the log potentials are linear in 
9 C and "features" u(y c ), and we can write: 



Within the exponential family, the advantage is that computing derivatives and likelihood 
and deriving the corresponding EM algorithm are feasible, although possibly computationally 
expensive, by using variational approximation strategies and Monte Carlo methods. A lot 
of methodology on the subject has been developed in the area of machine learning. There, 

7 A package written for the R statistical environment described at http://csde.washington.edu/ 
statnet/. See also the documentation in [138; 157; 224; 129]. 

Simulation Investigation for Empirical Network Analysis — a freestanding package available at http : 
/ / stat . gamma. rug.nl/snijders/siena.html. 



Pr(y|0) 



n cec ^(y c |0 c ) 



(3.8) 
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undirected graphs appear primarily in the context of relational learning and imaging. For an 
in-depth discussion on exact and approximation methods and for references see [247; 308]. 

3.7 Random Graph Models with Fixed Degree Distri- 
bution 

The Erdos-Renyi-Gilbert random graph model is fully symmetric and the expected degree 
(the number of edges associated with a node) is the same for all nodes in the graph, following 
a binomial distribution. A number of natural extensions of the Erdos-Renyi-Gilbert model 
result in varying node degrees. For example, 

• the preferential attachment model [26] captures the formation of hubs in a graph (see 
section 4.1); 

• the one-parameter "small-world" model [320] interpolates between an ordered finite- 
dimensional lattice and an Erdos-Renyi-Gilbert random graph in order to produce local 
clustering and triadic closures (see section 4.2). 

Albert and Barabasi [12] describe a number of variants on these themes. Many of the 
investigators exploring the use of such models often focus on the empirical degree dis- 
tribution, claiming for example that it follows a power-law in many real world networks 
(cf. [26; 232; 69; 91]). The papers utilizing these "statistical physics" style models often 
talk about fixed-degree distributions [e.g., 239], and they either fix the degree-distribution 
parameters or compute distributions that are conditional on some function of the degree 
distributions or sequences, such as their expectations (cf. [235; 70]). Software is available to 
sample from the space of random graphs with a given degree distribution based on Monte 
Carlo Markov chain methods [42; 138]. 

There would appear to be a direct link between these ideas and the representation of 
degree distributions in the family of p\ models. In the latter, the «j and parameters 
represent the out-degree and in-degree for the ith node, and the corresponding sufficient 
statistics are the empirical values for these. In the statistical literature there is a long tradi- 
tion of looking at distributions conditional on minimal sufficient statistics, and for network 
models such a notion was investigated as early as 1975 by Holland and Leinhardt, who looked 
at the version of pi with p = 0, conditioned on the empirical in-degree and out-degree for 
all nodes in the network [147]. This allows for the calculation of an exact distribution that 
is independent of the {a{\ and {$} by enumerating all possible adjacency matrices in the 
reference set with the observed in-degrees and out-degrees. There is the expectation that 
such an approach could lead to a uniformly most powerful test for p = 0, but there is no 
theory to support this expectation as of yet. McDonald, Smith and Forster [211] suggest an 
iterative approach for such calculations using a Metropolis-Hastings algorithm to generate 
from the conditional distribution of the triad census given the indegrees, the out-degrees and 
the number of mutual dyads. In a pair of papers [279; 280], Snijders and colleagues explore 
such conditioning for maximum likelihood estimation for exponential random graph models, 
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largely as a mechanism for avoiding the degeneracies and near degeneracies observed when 
unconditional maximum likelihood is used, cf. section 3.6 and [256]. Snijders [274] does 
something similar for dynamic models for graphs. Roberts [252] suggests an algorithm for 
the conditional distribution of the p\ model where pij = p given the full set of minimal suffi- 
cient statistics, but McDonald et al. [211] offer a counterexample and suggest an alteration 
of their algorithm to generate the proper exact distribution. Generating such exact distri- 
butions is a very tricky matter in discrete exponential families because of the need to utilize 
appropriate Markov bases, either explicitly as in Diaconis and Sturmfels [85] or implicitly. 
It is unclear whether the proposals in this literature are in fact reaching all possible tables 
associated with the distribution. 

Blitzstein and Diaconis [42] explore different efficient mechanisms for generating random 
graphs with fixed degree sequence and explicitly make the link between the "statistical 
physics" and "sociological" literatures, whereas the earlier papers by Newman [232] and 
Park and Newman [239] reference exponential random graphs but only approach the notion of 
fixed degree distributions from a statistical physics perspective, focusing on characteristics of 
network ensembles rather that maximum likelihood estimation and assessment of goodness- 
of-fit. 

3.8 Blockmodels, Stochastic Blockmodels and Com- 
munity Discovery 

A problem which has been a focus of attention for at least 40 years in the network literature 
has been the search for an "optimal partition" of the nodes into groups or blocks. In the 
sociometric literature this was known as blockmodeling. A formalization of networks in terms 
of non-stochastic blocks goes back at least as far as Lorrain and White [199]. Their paper 
and the discussion of structural equivalence gave rise to innumerable papers in mathematical 
sociology, (see, e.g., [53]) and algorithmic search strategies for determining blocks (see, e.g., 
[19; 88; 89]). By embedding these ideas within a framework of random graphs, Holland et al. 
[150] explained how a special version of p\ could be used to describe a random graph model 
with predefined blocks. (See also the related discussion in [103] and [311].) 

A true stochastic blockmodel approach, however, involves the discovery of the block struc- 
ture as part of the model search strategy [314], and the first attempts at doing this within 
the framework of p\ and its exponential family generalizations was due to Nowicki and Sni- 
jders, who focused on technical issues such as non-identifiability in a restricted version of the 
blockmodel [277; 236; 237; 79]. A comprehensive statistical treatment of these models was 
recently developed for analyzing protein interaction data [7; 8] and then further developed in 
the context of social network data [9]. Handcock et al. [137] approach this stochastic block- 
modeling problem through a combination of latent space models and traditional clustering. 
We decribe some of this work in more detail below. 

More recently in the statistical physics and computer science literatures the problem 
has gone under the label of detection of community structure, e.g., see [122; 232; 71; 233; 
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266; 217]. This literature is now voluminous and seemingly unconnected to the statistical 
blockmodel work. 

The basic idea, in both the model-based and algorithmic approaches as well as the com- 
munity detection literature, is that nodes that are heavily interconnected should form a 
block or community. The nodes are reordered to display the blocks down the diagonal of 
the adjacency matrix representing the network. Moreover, the connections between nodes 
in different blocks appear in much sparser off-diagonal blocks. In model-based approaches, 
the partition of the nodes maximizes a statistical criterion linked to the model, e.g., a like- 
lihood function, whereas most algorithmic solutions maximize ad hoc criteria related to the 
"density" of links within and between blocks. 

More formally, a blockmodel is a model of network data that relies on the intuitive 
notion of structural equivalence: two nodes are defined to be structurally equivalent if their 
connectivity with similar nodes is similar — this is a "soft" definition. 9 Following up this idea, 
we can imagine collapsing structurally equivalent nodes together to form a super-node, or a 
block in the language of blockmodels. Keeping the notion of a block in mind we can now 
revisit and sharpen the definition of structurally equivalent nodes: given N nodes and K 
blocks, let Y/vxat be the adjacency matrix of the graph G(Af, y), then two nodes a and b are 
structurally equivalent, and thus belong to the same block h, if their connectivity patterns 
C a and Cb with nodes in other blocks are similar. The equivalence between connectivity 
patterns of nodes a and b can be formally stated as follows: 

C a ={ Y(a,i Eh k ) :\/h k ^h} « C b , 

where the index % runs over the nodes other than a, b, the index k runs over the blocks other 
than h, h k is the set of nodes in block k, and ~ quantifies similarity according to a suitable 
distance metric. This definition relies on a pre-specified partitioning of the N nodes into K 
blocks. A blockmodel is useful, for instance, in the analysis of social relations where blocks 
may correspond to social factions, as well as in the analysis of protein interactions where 
blocks may correspond to stable protein complexes. 

Collapsing nodes into blocks by leveraging the notion of structural equivalence above is 
a more general task than clustering. Consider, for example, the green nodes 7-9 in the left 
panel of Figure 3.2. They are structurally equivalent according to the definition above, as 
C-j ~ C$ ~ Cg, although there are no direct connections among the nodes 7-9 themselves. In 
this sense, nodes 7-9 would not represent a tight cluster according to measures of similarity 
based on direct connectivity. Blocks that would correspond to clusters can be obtained by 
pre-specifying an identity blockmodel, B = in which all off-diagonal blocks equal zero 
and all diagonal blocks equal one. 

At the technical level, we need two sets of parameters in order to instantiate a blockmodel: 
(i) the blockmodel itself, B, is a K x K matrix, in which the B(g,h) entry specifies, for 
instance, the average probability that nodes in block g have connections directed to nodes 
in block h, and (ii) a mapping between nodes and blocks, 7fi : 7v = n, where the node- 
specific array summarizes some notion of membership. Airoldi et al. [9], for instance, specify 

9 The term stochastic equivalence is often used in place of structural equivalence, e.g., see [315]. 
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Figure 3.2: Left: An example graph. Right: The corresponding blockmodel, where red 
nodes have been collapsed into the red block and similarly for the other colors. Note that 
this problem is not a typical clustering problem, as the green nodes do not share any direct 
connections; each green node, however, has connections directed to blue nodes, and con- 
nections directed from red nodes. In other words, given the partition into monochromatic 
blocks, the nodes in the green block share patterns of connectivity to nodes in other blocks. 

the mapping in terms of mixed membership arrays, in which n n (h) specifies the relative 
frequency of interactions. Node n participates in 2N-2 interactions in total and instantiates 
connectivity patterns that are typical of nodes in its block. These two sets of parameters, B 
and II, are two latent sources of variability that compete to explain the observed connectivity. 
However, the blockmodel B explains global asymmetric block connectivity patterns, while 
the (mixed) membership mapping II explains node-specific symmetric connectivity patterns. 
In this sense, instantiating a blockmodel in terms of B and II does not introduce any source 
of non-identifiability beyond the usual multiplicity of parametric configurations that lead to 
exactly the same likelihood — well characterized in this model by Nowicki and Snijders [236]. 

As a concrete example, consider the mixed membership stochastic blockmodel (MMB) 
introduced by [9] ; the data generating process for a graph G = (Af, Y) is the following. 

1. For each node p G Af: 

1.1 Sample mixed membership n p ~ Dirichlet^ (a J. 

2. For each pair of nodes (p, q) G Af x Af: 

2.1 Sample membership indicator, z p ^ q ~ mult^(7f p ). 

2.2 Sample membership indicator, z p <~ q ~ mxHt K(^q)- 



2.3 Sample interaction, Y(p, q) ~ Bern (z p ^ q B z. 



Note that the group membership of each node is context dependent. That is, each node 
may assume different membership when interacting or being interacted with by different 
peers. Statistically, each node is an admixture of group-specific interactions. The two sets of 
latent group indicators are denoted by {z p ^ q : p,q G Af} =: and {z p ^. q : p,q G Af} =: Z^. 
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Also note that the pairs of group memberships that underlie interactions need not be equal; 
this fact is useful for characterizing asymmetric interaction networks. Equality may be 
enforced when modeling symmetric interactions. 

Inference in the blockmodel is challenging, as the integrals that need to be solved to 
compute the likelihood cannot be evaluated analytically. For simplicity, the likelihood is 

£(Y \a,B)= I [ Pr(F | Z,B) Pi(Z | n) Pr(II | a) dZ dU. 
in Jz 

While the inner integral is easily solvable 10 , the outer integral is not. Exact inference is thus 
not an option. To complicate things, the number of observations scales as the square of 
the number of nodes, 0(N 2 ). Sampling algorithms such as Monte Carlo Markov chains are 
typically too slow for real-size problems in the natural, social, and computational sciences. 
Airoldi et al. [9] suggest a nested variational inference strategy to approximate the posterior 
distribution on the latent variables, (II, Z). (Variational methods scale to large problems 
without loosing much in terms of accuracy [3; 49; 308].) 

Bickel and Chen [37], the most recent contribution to this literature, brings new twists 
to the model-based approach of community discovery. They use a blockmodel to formalize a 
given network in terms of its community structure. The main result of this work implies that 
community detection algorithms based on the modularity score of Newman and Girvan [122] 
are (asymptotically) biased. It shows that using modularity scores can lead to the discovery 
of an incorrect community structure even in the favorable case of large graphs, where com- 
munities are substantial in size and composed of many individuals. This work also proves 
that blockmodels and the corresponding likelihood-based algorithms are (asymptotically) 
unbiased and lead to the discovery of the correct community structure. The proof relies on 
the exchangeability results developed in the statistics community [15; 165] applied to paired 
measurements [84]. 

3.9 Latent Space Models 

The intuition at the core of latent space models is that each node i G M can be represented 
as a point zi in a "low dimensional" space, say M. k . The existence of an edge in the adjacency 
matrix, Y(i, j) = 1, is determined by the distance among the corresponding pair of nodes in 
the low dimensional space, d(zi, Zj), and by the values of a number of covariates measured 
on each node individually. The latent space model was first introduced by Hoff et al. [146] 
with applications to social network analysis, and has been recently extended in a number 
of directions to include treatment of transitivity, homophily on node-specific attributes, 
clustering, and heterogeneity of nodes [144; 137; 183]. 

10 The inner integral resolves into a series of sums, each one over the support of an individual z variable. 
The support is the same for all such z variables, and it is given by the N vertices of the X-dimcnsional unit 
hypercube. In other words, the inner integral is a series of sums, each over the same N elements. 
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The conditional probability model for the adjacency matrix Y is 
Pr(F | Z,X,e) = JJPr(K(z, j) | Z^Z^X^Q), 

i+j 

where X are covariates, G are parameters, and Z are the positions of nodes in the low di- 
mensional latent space. Each relationship Y(i,j) is sampled from a Bernoulli distribution 
whose natural parameter depends on Zj, Zj, Xij and G. In their model, Hoff et al. [146] gen- 
erated the paired observations Y(i,j) starting from the relevant pair of node representations, 
(Zi, Zj), through a distance model, pair specific covariates X^, and parameters Q = 
The log-odds ratio is then: 

log i-Pr(y(^-) = i) = a + P XlJ ~ lZl ~ H = ^ 

and the corresponding log likelihood is 

logPr(F \rj) = J2( ■ Y v ~ lo g( X + e "") ) • 

One can easily extend the latent space modeling approach to weighted networks. In the 
general case, paired observations Y may be modeled using a generalized linear model that 
makes use of Z\._^ X^, and G. Following the formalism in [210], a generalized linear model 
that generates the observed edge weights can be specified in terms of three quantitative 
elements: 

i. the error model Pr(Yy), i.e., the model for the observed edge weights with mean ^ = 

ii. the linear model = %(/3, Z iy Zj); 

iii. the link function g(f-iij) = which maps the support of ^ to that of 77^ — typically 
E. 

For example, in the binary graph, the error model is Pr(Y^) = Bern(/iy), where fiy G [0, 1] 
for all node pairs C A/"; the linear model is rjij = (3 + d{Z^Zj)\ the link function is 

g(fMj) = log ( ), with its inverse being ^ = l+e ^ ^ [146]. In a graph with non- 

negative, integer edge weights, we can posit Pr(Y^) = Poi (/%), where //^ G R + for all node 
pairs G TV; the linear model is rj^ = /3 + d(Zi, Zj), the same as in the previous example; 
the link function is g{nij) = log(jUy), and its inverse is \iij = e Vij . 

In the general case, the generalized linear model for rj^ may also include an explicit 
distance model d in the latent space Z: 

Vij = Vij ( Pi %ii ^j ) 

= Vij (Pid(Zi,Zj) ) . 
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Note that it is possible to re-parametrize Zj = piUJi to separate the position in a latent 
reference space, Q, from its magnitude, pi, a scalar. It is a simple intuition that suggests the 
use of an explicit distance model in the latent space. In a binary graph, for example, edges 
are more likely to be generated between pairs of nodes whose representations in the latent 
space are close. A popular choice of distance measures is Euclidean distance. Estimation 
can be done via MCMC sampling. 

Inference in latent space models has been carried out via Monte Carlo Markov chain in 
networks with up to several thousand nodes [130]. Scalability issues remain to be addressed 
before larger networks can be analyzed. 

3.9.1 Comparison with Stochastic Blockmodels 

The latent space model of Hoff et al. [146] projects nodes onto a latent Euclidean space by 
inverting the logistic link. While in practice there is often interest in identifying groups of 
similar nodes, e.g. individuals or proteins, there is no explicit clustering model in the latent 
space. To identify groups of similar nodes, clustering methods must be used to analyze the 
set of latent positions inferred by the latent space model. To allow joint inference on latent 
positions and clusters, Handcock et al. [137] introduce an explicit clustering model in the 
latent space in the form of a mixture of (spherical) Gaussians. 

(Pi(Y\z,x,e) = Ui^(y(hj)\z l ,z 3 ,x lJ ,Q), 

\ Z t ~ E k N (p k ,al-I). 

This model combines the original latent space model [146] with a finite mixture of Gaussians 
approach to clustering [297; 205]. It posits that the latent positions Z { e M. d come from a 
/c-dimensional mixture model. 

This extension is related to the stochastic blockmodel of [9], which posits a latent mem- 
bership vector for each node. These vectors can be viewed as cluster assignment probabilities 
for each node. The observed binary relationships between nodes are mediated by per-pair 
latent variables, each drawn conditioned on a node's mixed membership vector. In its gen- 
eral form, the blockmodel allows for multiple relations and covariates. Similarly, the model 
in [137] is also a hierarchical model, as a Gaussian distribution is placed on the latent posi- 
tions Z{. In contrast, however, each node belongs to a single cluster and the corresponding 
partition governs the observed relationships. There can be variance in the latent position 
variables, but the idea of belonging to two or more groups cannot be represented. Posterior 
uncertainty about cluster membership is different from having an explicit distribution that 
controls mixed membership, which carries with it an additional level of uncertainty. With 
that said, the latent space in which nodes are projected in [137] is somewhat comparable to 
the space of cluster proportions in [9]. The former maps nodes to a Euclidean space, while 
the latter maps nodes to the simplex. 

Both models share the same goal: inferring latent structure that explains the variability 
of the connectivity in an observed network. In the mixed membership model, full MCMC 
for any but the simplest problems is unreasonably expensive. Airoldi et al. [9] appeal to 
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variational methods for a computationally efficient approximation to the posterior. These 
methods can scale to large matrices (e.g., millions of nodes) because of the simplified approxi- 
mation, but at an unknown cost to accuracy. It would be interesting to explore computational 
tradeoffs for the latent space cluster model [137] as the sample size grows and when large 
numbers of covariates are added. 

Remark. Blei and Fienberg [40] argue that a stochastic blockmodel and node-specific 
mixed membership vectors are two sets of parameters that are directly interpretable in terms 
of notions and concepts relevant to social scientists, and better suited to assist these scientists 
in extracting substantive knowledge from noisy data, to ultimately inform or support the 
development of new hypotheses and theories. 

Applying the mixed membership stochastic blockmodel (MMB) to Sampsons data demon- 
strates both similarities and differences [2]. For instance, BIC suggests the existence of three 
factions among the 18 monks when fitting the MMB, but the groupings differ from those 
found by the latent space cluster model. One major benefit of applying mixed membership 
model to the data is the ability to quantitatively identify two out of three of the novices that 
Sampson labeled as waverers in his analysis based on anthropological observations. This 
could lead to the formation of a social theory of failure in isolated communities, with a 
possibility to be confirmed with real longitudinal data [2]. 

In the sociology literature, certain specifications of blockmodels are referred to as la- 
tent class models, and certain specifications of latent space models are referred to as latent 
distance models. Hoff [145] provides a nice comparison, both theoretical and empirical, of 
these two types of models with the eigenmodel. The eignemodel is based on a singular 
value decomposition of the socio-matrix, it can capture more connectivity patterns than the 
latent class and the latent distance models, for a given degree of model complexity, which 
can be the number of classes, the number of dimensions in the latent space, or the number 
of eigenvectors. There is a price to pay, however. The eigenmodel is the least amenable 
to interpretation among the three models, as the inferred patters that capture connectivity 
are in terms of eigenvectors. The latent space model can be interpreted in terms of dis- 
tances. The latent class models can be interpreted in terms of blocks of connectivity, or 
tight micro-communities; this is the easiest model to interpret. 

Appendix: Phase Transition Behavior of the Erdos-Renyi- 
Gilbert Model 

A simple way to analyze the phrase transition behavior of Erdos-Renyi- Gilbert models at A = 
1 is to study the emergence of the giant component as a branching process [91]. Intuitively, 
consider branching processes that start at every node: for certain values of A all the branching 
processes will keep growing with high probability. Their supports, i.e., the sets of nodes 
involved in each process, will intersect with high probability, leading to the emergence of the 
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giant component, G, in which each node can be reached from every other node. 

The following formal argument comes from lecture notes by Guetz and Constantine [133] 
based on proofs given by Janson et al. [161]. Pick a node v G A/". If v is connected to all of 
the nodes in G, then we say that v is saturated in G. Now work as follows: pick a node v and 
place it on the list. Then, identify all its neighbors in G, and add them to the list. Next, 
take the first unsaturated node on the list and add to the list all of its neighbors which are 
not already in it. The proof is constructed by considering the distribution of the number 
of nodes an unsaturated node adds to the list and by using Chernoff bounds to bound the 
size of the connected component each node belongs to. For details on this proof please see 
[43]. Bollobas et al. [45] carried out an extensive analysis of the phase transition that 
mathematically characterizes emergence of the giant component in inhomogeneous random 
graphs. 
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Chapter 4 

Dynamic Models for Longitudinal 
Data 

In chapter 3 we focused on models for static networks, that consider a cross-section of a real 
network at a given point in time. However, real networks often contain a dynamic component. 
In the language of networks, dynamics can be translated into the birth and death of edges 
and nodes. For example, in a friendship network, new nodes may be introduced at any time 
and old nodes may drop out due to inactivity; links of friendships and alliances may be 
even more brittle. Dynamic network modeling has been a neglected sibling of static network 
modeling, partly due to the added complexity and partly due to a lack of datasets to study. 
Sampson's monastery study [259] produced one of the earliest datasets with information on 
the dynamics in the network of the 18 initiates. The original research, however, focused on the 
network structure at each given time point, rather than modeling the underlying dynamics 
explicitly. As online communities gain in popularity, we are beginning to get access to an 
increasing number of dynamic network datasets of much larger size and longer time span. At 
the same time, advances in statistical and computational methods for inference and learning 
have enabled development of richer models. Bearing this in mind, in this chapter we consider 
three different classes of models. We begin by revisiting the Erdos-Renyi-Gilbert random 
graph model and its generalizations, viewing them as models for dynamic processes. Then 
we turn to continuous time Markov process models (CMPM) and their discrete time cousins 
(such as a dynamic version of ERGM and other recently proposed models). 

4.1 Random Graphs and the Preferential Attachment 
Model 

Many variations on the classical Erdos-Renyi-Gilbert random graph model in section 3.2 are 
typically considered to be static models, in that they model a single, static snapshot of the 
network, as opposed to multiple snapshots recorded at different time steps. However, they 
also contain processes for link addition and modification, which is a dynamic process that may 
have generated the observed graph, though there is no attempt to fit these dynamic model 
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properties to observed data. For this reason, we view them as "pseudo-dynamic" models 
and discuss three examples here: the Erdos-Renyi-Gilbert model, preferential attachment 
model, and small-world models. 

For example, we can view the Erdos-Renyi-Gilbert model G(N,E), itself as a dynamic 
process used to generate a random graph: 

• start from the graph of N unconnected nodes at time 0; 

• at each subsequent time step, add a different edge to the network with probability 
P = E/( N 2 ). 

By convention, we usually fix the number of nodes at N, although we can extend the process 
to allow for addition of nodes. This model assumes that edges (and nodes) are not removed 
once they are added. The degree distribution for G(N, E) is binomial. But as N gets large, 
Np tends to a constant, so it is approximately Poisson. Durrett [91] provides a rich discussion 
for situating this dynamic description with the tradition of discrete time random walks and 
branching processes. In particular, he uses this representation to explore the emergence of 
the giant component described in section 3.2 (see appendix of chapter 3). 

The Erdos-Renyi-Gilbert model is simple and easy to study but does not address many 
issues present in real network dynamics. One of the major criticisms [26] of this model 
centers on the fact that it does not produce a scale- free network, i.e., the resulting node 
degree distribution does not follow a power law. The network literature is replete with 
claims that many real networks exhibit the power-law phenomenon, (cf. [12]), and much 
subsequent research has focused on how various generalizations of the Erdos-Renyi-Gilbert 
model conform to the power law degree distribution. Molloy and Reed [219] were the first 
to describe how to construct graphs with a general degree distribution and they went on to 
describe the emergence of the giant component in that context as well [220]. 

Barabasi and Albert [26] described a dynamic preferential attachment (PA) model specif- 
ically designed to generate scale-free networks. At time 0, the model starts out with N Q 
unconnected nodes. At each subsequent time step, a new node is added with m < N edges. 
The probability that the new node is connected to an existing node is proportional to the 
degree of the latter. In other words, the new node picks m nodes out of the existing network 
according to the multinomial distribution 

_ _jh_ 

where 5i denotes the (undirected) degree of node i. This model, which was described much 
earlier in the statistical literature by Yule [329] and Simon [269], is intended to describe 
networks that grow from a small nucleus of nodes and follow a "rich-get-richer" scheme. 
The assumption is that, for instance, a new web page will more likely link via a URL to 
a well-known web page as opposed to a little-known one. Mitzenmacher [218] gives a brief 
history of generative models for power law distributions. 

The preferential attachment model of Barabasi and Albert results in a network with 
a power law degree distribution whose exponent is empirically determined to be ^ba = 
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2.9 ± 0.1, whereas the Erdos-Renyi-Gilbert model has a Poisson degree distribution. Many 
extensions of the model have been proposed that allow for flexible power-law exponents, 
edge modifications, non-uniform dependence on the node degree distributions, etc. For 
example, Dorogovtsev and Mendes [90] proposed that creating an edge to node i should 
be proportional not just to its degree h{ but also to its age, decaying as it — tif~ v ', where 
v is a tunable parameter. This leads to a power law degree distribution only if v < 1. 
Barabasi et al. [28] and Durrett [91] provide an account of this and other extensions to the 
original model of Albert and Barabasi. Alternative graph generation mechanisms appear 
every day— R-MAT [60], 'winners don't take all' [242], 'forest fire' [194], 'butterfly' [212] and 
RTG [10], to name a few. The latest, RTG model, proves conformance to 11 empirical laws 
observed in real networks. The main goal of these random graph models is to describe 
a process that could generate networks emulating certain known network properties. The 
generative process could then give an insight into the dynamics that led to the observed 
network. But these models are often applied to network data are gathered at a few points 
in time (sometimes only once). Thus the networks are often examined statically. 

It has been recently pointed out that, contrary to previous claims, the empirical laws 
that generative models aim to emulate are not always supported by real data. Visual com- 
parison are not sufficient for determining the goodness of fit of a model. For example, a lot 
of attention has recently been paid to the degree distribution. Figure 4.1 shows indegree and 
outdegree distributions for blog and query databases from an unnamed large company. They 
are plotted on a log-log scale and the downward slopes, if fitted by straight lines, would be 
visually similar to power law distributions with exponents less than 2. A careful examination 
of these plots, however, reveals a curvilinear relationship in all cases, which suggests that 
there is a different generating process than those usually used to justify power laws in em- 
pirical network data. Data such as that displayed in Figure 4.1 are often fitted by ordinary 
least squares or even by eye; often the claim is that a degree distribution is scale-free except 
for a cutoff at very high or very low degrees, without any adjustment for searching for a 
cutoff! There have been a number of recent efforts to assess the fit of degree distributions, 
such as those associated with power laws in log- log plots, with more rigor, e.g., see [74]. As 
in this example, results from such careful assessments of fit often contradict the assumption 
of linearity. 

Li et al. [196] give a "structural metric" for examining simple connected graphs having 
identical degree distributions and derive theoretical properties of scale-free graphs. They 
provide at least one possible way to assess whether a graph corresponding to a network is 
in fact scale-free. For more informal discussions related to this theoretical work, see [14; 
324]. Flaxman et al. [106; 107] describe a class of network models linked to the preferential 
attachment model that also yield a power-law degree distribution. 

Most descriptions of generative models fall short of studying the full parameter space and 
do not propose procedures for fitting the proposed methods to real data, though there are a 
few works that suggest maximum likelihood, MCMC and other frameworks for fitting these 
models to data (for e.g. [34; 75; 214; 323]). One of the notable exceptions is work based 
on Kronecker graph multiplication. What started as yet another generative procedure [192] 
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Figure 4.1: Log-log plots of degree distributions for a query data bases and a blog data base 
from a company database. Left: Blog indegree and outdegree distributions. Right: Query 
indegree and outdegree distributions. Source: Data from an unnamed large company, stored 
in iLab, Carnegie Mellon University. 



has turned into a well analyzed methodology [195] with an efficient algorithm for model 
fitting, analysis of the parameter space, and model selection. This work goes further in 
understanding real network structure and provides a way for principled graph sampling. 



4.2 Small- World Models 

Watts and Strogatz [320] proposed a small-world model which can be thought of as a "pseudo- 
dynamic" model in the sense we described in section 4.1. This one-parameter "small- world" 
model interpolates between an ordered finite-dimensional lattice and an Erdos-Renyi-Gilbert 
random graph in order to produce local clustering and triadic closures. Bollobas and Chung 
[44] had previously noted that adding random edges to a ring of iV nodes drastically reduces 
the diameter of the network. The Watts-Strogatz model begins with a ring lattice with N 
nodes and k edges per node, and randomly rewires each edge with probability p. As p goes 
from to 1, the construction moves toward an Erdos-Renyi-Gilbert model. They and others 
who followed, studied the behavior of such small- world networks when < p < 1. This 
model is not dynamic although it is often used to describe networks that evolve over time. 
Figure 4.2 shows a small-world graph for n = 25 nodes and 2 rewirings per node. 

Kleinberg [174] introduced a variation on the small- world model where random edges are 
added to a fixed grid. Starting with an underlying finite-dimensional grid, he added shortcut 
edges, where the probability that two nodes are connected by a long edge depends on the 
distance between them in the grid. More precisely, the probability that two non- adjacent 
nodes x and y are connected is proportional to d(x,y)~ a . With a set to the dimension 
of the lattice, the greedy routing algorithm can find paths from one node to another in a 
polylogarithmic number of expected steps. 
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Figure 4.2: Small-world graph for N = 25 nodes and 2 rewirings per node. The red edges 
form the ring lattice and the blue edges the rewiring. This graph was generated using the 
Java applet at http://cs.gmu.edu/~astavrou/smallworld.html 

Several follow-up works have made adjustments to Kleinberg's rewiring procedure in 
attempt to improve the understanding and efficiency of the navigability of networks. For 
example, Clauset and Moore [72] suggested to rewire a long distance edge from node x, if 
while performing a greedy walk over to y, the original topology of the network did not allow 
to reach y within T thresh steps. The edge was rewired to the place where the search gave 
up (the node reached after T thresh steps of the walk). They show that through this rewiring 
procedure the network degree distribution converges to a power law, where a = a rew i re d- 
Their work also studied finite size effects and showed that a opt — > d, as n —>■ oo rather 
slowly. 

Sandberg [260, 261] and Sandberg and Clarke [262] introduced a different rewiring scheme 
with the end goal to make the network more amenable to statistical analysis. Starting with 
./V nodes on a ring, each with two neighbor links and a long range link, the model of Sandberg 
[260] randomly rewires a graph in the following steps: 

• at each time step j = 1, 2, 3, ... , choose a random starting node x and a target node 
y and perform greedy routing from x to y; 

• independently and with (small) probability x, update the long-range link of each node 
on the resulting path to point to y. 
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This defines a Markov chain on a collection of labeled graphs. Sandberg and Clarke [262] 
conjecture that when the chain achieves stationarity, the distribution of distances spanned 
by long-range links is (close to) theoretical optimum for search and the expected length of 
searches is polylogarithmic. They support the conjecture by a series of simulations. This 
methodology has been applied to the study of peer-to-per (P2P) networks. 

Durrett [91] discusses links between small- world models and stochastic processes. Typical 
usage of small-world models include empirical analyses involving aggregate summary statis- 
tics (see, e.g., [18; 231]). There are as yet no formal statistical methods for examining the 
evolution of small-world network models and for assessing their fit to network data measured 
over time. 

4.3 Duplication- Attachment Models 

Duplication-Attachment models were originally developed in the computer science theory 
community to study the world wide web as a directed graph [175; 185]. These models aim 
at describing properties of a snapshot of the web graph at a specific time, that is, a static 
directed graph. The data generating process underlying these models, however, is explicitly 
dynamic. The following example demonstrates some basic assumptions behind the dynamics. 
Consider a newly added web page A, which provides a new node in the web graph. The 
creator of web page A will then add hyper-links to it, which provide new directed edges in the 
web graph. In particular, some of these hyper-links will point to other web pages regardless 
of whether their topical content matches the topical content of web page A, but most of these 
hyper-links will point to web pages with a topical content that closely matches the topical 
content of web page A. 

Technically, there are many possible specifications and variants. The basic duplication- 
attachment model proposed and analyzed by Kumar et al. [185] is as follows. Denote the 
graph at time t as G t — (Aft, St)- At each step, say t+ 1, one new node N is added to G t . 
The new node is connected to a prototype node m, chosen uniformly at random among those 
in Aft- Then d out-links are added to node N. The iih out-link is chosen as follows: with 
probability a the destination node is chosen uniformly at random among those in Aft, and 
with probability 1 — a the destination node is taken to be the ith out-link of the prototype 
node m. Note that this is possible since the algorithm generates a constant degree graph. 
Rather than proposing estimation strategies for the two parameters (a, d) of this particular 
duplication-attachment model, the goal of the analysis of Kumar et al. [185] is on deriving 
results about topological properties of duplication-attachment graphs, described as functions 
of the two parameters (a, d). Recent extensions of this model include a model where frac- 
tions of both out-links and in-links of the prototype node m are copied by the newly added 
node N [193]. The goal of the analyses in this line of research, however, remains that of 
replicating properties of observed graphs, with a few exceptions. In the biological context, 
duplication-attachment models have appeared to be useful in modeling protein-protein in- 
teraction networks. For example, Ratmann et al. [245] proposed a mixture of preferential 
attachment and duplication divergence with parent-child attachment model to assess evo- 
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lutionary dynamics of protein interaction networks of H. pylori and P. falciparum. They 
proposed a likelihood-free MCMC-based routine to estimate posterior of network summary 
statistics. A more general review of work in modeling dynamics (evolution) on the basis of 
protein-protein interaction data is available in [246]. 

Wiuf et al. [326] have developed a recursive construction of the likelihood for duplication- 
attachment models, effectively enabling principled statistical data analysis, estimation and 
inference. 



4.4 Continuous Time Markov Chain Models 

The use of continuous Markov processes to model dynamic networks was first proposed by 
Holland and Leinhardt [148] and Wasserman [312] and most recently studied by Snijders 
and colleagues [275; 276]. As shall become clear in this section, continuous Markov process 
models (CMPM) are intimately tied to the ERGM models described in section 3.6. Within 
the CMPM family, network edges are taken to be binary (either absent or present, but 
not weighted), and the evolution occurs one edge at a time. Model variants arise due to 
the many possible specifications of edge change probability. Some exceptions to this general 
approach include the party model of Mayer [206], where multiple edges are allowed to change 
at the same time, and the work of Koskinen and Snijders [179], which deals with Bayesian 
parameter inference methods for the case where not all edge modifications are observed. 

We begin by providing a quick reminder of continuous Markov processes, borrowing 
notation from [275]. Define {Y(t) | t G T} to be a stochastic process, where Y(t) has a finite 
outcome space y and T is a continuous time interval. Suppose that a Markov condition 
holds: for any possible outcome y G y and any pair of time points {t a < t b \ t a ,tb G T}, 

Pr{Y(t 6 ) = y | Y(t) = y(t),Vt :t<t a } = Pr{Y(t b ) = y\Y(t a ) = y(t a )}. (4.1) 

In other words, supposing that t b denotes the future and t a the present, then conditioning 
on the past is equivalent to conditioning on the present when it comes to determining the 
future. If the probability in Equation 4.1 depends only on t b — t a , then one can prove that 
Y(t) has a stationary transition distribution, and the transition matrix 

Pr(t b - t a ) : = Wr{Y(U) = y | Y(t a ) = y}} (4.2) 
L -I y,yty 

can be written SIS db matrix exponential 

Pr(t) = e tQ } (4.3) 

where Q is known as the intensity matrix with elements q(y,y)- The elements q{y,y) can be 
thought of as the slope (rate of change) of the probability of state change as a function of 
time, i.e., Pr{F(t + e) = y \ Y(t) = y} w eq(y,y). The diagonal elements q(y,y) are negative 
and are defined so that the rows of Q sum to zero. 

When modeling a social network, the outcome space y is taken to be all possible edge 
configurations of an iV-node network, and an individual configuration y G 3^ is taken to be 
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a binary vector of length ( 2 ). We use the shorthand ?y(y) to denote the propensity for 
the edge between node i and j to flip into its opposite value under configuration y. The 
function %(y) completely specifies the dynamics of the network model. We now review 
several variants of CMPM which differ only in their definition of <7y(y). 

Independent arc, reciprocity, and popularity models. The independent arc model 
employs the simplest definition of %(y): 

Independent arc model: %(y) = \ yij , (4.4) 

i.e., Yij changes from to 1 at a rate Aq, and from 1 to at rate Aj. In this model, 
modification to one edge does not depend on the setting of other edges. The model is simple 
enough that the transition probabilities Pr(t) can be derived in closed form (see, e.g., Taylor 
and Carlin [292] p. 362-364). Maximum likelihood parameter estimation for this model was 
discussed in [278]. 

In the reciprocity model, the rate of change in depends only on the reciprocal edge 

Reciprocity model: g^(y) = X Vi . + ii U i ijy,. (4.5) 

Thus, if no link currently exists between nodes % and j, then the propensity for adding 
either directed edge is A ; if one directed edge exists, then the reciprocal edge is added with 
propensity Ao + //q. If one directed edge exists, then it is deleted with rate Ai. If both edges 
exist, then the deletion propensity for either is Ai + fi±. The transition matrix Pr(t) can be 
derived but has a complicated form [189; 272]. 

Along the same line of development, the popularity model and the expansiveness model 
[312; 313] define the change rate for edge to be dependent on y + j, the in-degree of node 
j, or y i+ , the out-degree of node i: 

Popularity model: g^-(y) = X Vij + Tr yij y +j , (4.6) 
Expansiveness model: %(y) = \ Vij + n yij y i+ . (4.7) 

Edge-oriented dynamics. Snijders [276] outlines two categories of transition dynamics: 
edge-oriented and node-oriented. In both cases, the intensity matrix is factored into two 
components: one controls the opportunity for change, and the other specifies the propensity 
of change. More precisely, the continuous time Markov process is now split into two sub- 
processes; the first operating in the continuous time domain and dictating when a change 
should occur; the second dealing with the probability of the discrete event of individual 
edge flips. Both edge-oriented and node-oriented dynamics can be interpreted as stochastic 
optimizations of a potential function /(y) on the network configuration. The difference is 
that, in the edge-oriented case, / is based on global statistics of the network, whereas in the 
node-oriented case, / is defined for each node's local neighborhood. Moreover, the choice of 
which edge to flip differs between the two formulations. 
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Using y(i,j,z) to denote the configuration where the edge has the value z G {0, 1}, 
edge-oriented dynamics can be written in the following general form: 

fti(y) = pp%j(y), ( 4 - 8 ) 

where 

v ..( y ) = exp(f(y(z,j,l-y tJ ))) 

PtAY) exp(/(^,i,0)))+exp(/(^,i,l)))- 1 • ] 

Thus, in edge-oriented dynamics each edge follows an independent Poisson process, so that 
the time until the next event has an exponential distribution with parameter p. When an 
event occurs for edge i — > j, the edge flips to its opposite value with probability Py(y). 

The potential function /(y) is usually defined as a linear combination of network statis- 
tics: 

/(y) = Z>**(y)- (4-io) 

This should start to look familiar. Indeed the CMPM process with edge-oriented dynamics 
is equivalent to the Gibbs sampling process for ERGMs (where the next edge to be updated 
is selected randomly). The statistics s&(y) for node k take on the usual forms (see Table 4.1). 

Number of directed arcs: 
Number of reciprocated arcs: 



*(y) 




S2(y) 




sa(y) 


ij 

= ^VkjVji 


S4(y) 


ijk 


s s(y) 


ijk 


s e(y) 


ijk 

= ^yijVikVjk 




ijk 



Number of paths of length two: 
Number of transitive triplets: 

Table 4.1: The table of network statistics for a directed social network. 

The statistics in Table 4.1 assume directed graphs, however it is easy to come up with 
the corresponding statistics for undirected graphs. For example, in the undirected case all 
the edges are "reciprocal" and thus si and S2 are combined into s'(y) = ^ j>i£VsVij- 

Due to their close relations to ERGMs, edge-oriented models suffer the same fate of 
degeneracy. For example, if the parameter (3 for transitive triplets is not too small, then 
with high probability the simulated network will be a complete graph. However, compared 
to static networks, degeneracy in the longitudinal case is not as much a concern, as the 
complete graph will only emerge at some distant time in the future. 
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Node-oriented dynamics. Fully node-oriented dynamics [275] defines the intensity ma- 
trix as 

Qij(y) = PiPij (y), ( 4 - n ) 

where 

, x _ exp(/,(y(*,j, l-y -))) 

Thus the independent Poisson processes for determining edge change opportunity are now 
defined for each node (with intensity p^) as opposed to each edge. Given the opportunity for 
edge change, each node seeks to optimize its own potential function as defined by 

/i(y) = X)/Wy)- (4-13) 

k 

The function fi(y) is similar to the global potential /(y) in Equation 4.10 but only aggregates 
over the local neighborhood of node %. Node % favors changing the incident edge that would 
lead to the biggest increase in its potential. 



Edge-node mixed dynamics. Snijders [276] also suggested a form of mixed dynamics 
where the opportunity for change is edge-oriented, but the potential functions are node- 
oriented: 

, x = exp(/ t (y(U,l-^))) 

Thus the opportunity to modify each edge i — > j follows independent Poisson processes with 
parameter p. But given the opportunity for change, the probability of an actual flip depends 
on node i's local network configuration. 



Remark. Parameter estimation in CPCM models has until recently been done via method 
of moments, where the expected values are obtained through MCMC on simulated networks 
[273]. Koskinen and Snijders [179] proposed a Bayesian inference method that allows for 
computation of the posterior distribution of the parameters and treats missing values more 
adequately. For details of the procedure, please refer to Koskinen and Snijders [179]. 



4.5 Discrete Time Markov Models 

In this section, we outline three recent proposals of dynamic network models operating in 
the discrete time domain (see also [22]). All three models have the Markov property and 
represent the likelihood as a sequence of factored conditional probabilities 

Pr(Y 1 , Y\ . . . , Y T ) = Pr(F r | Y T ^ 1 ) Pr(F T - 1 | Y T ~ 2 ) ■ ■ ■ Pr(F 2 | Y 1 )), (4.15) 

where {Y 1 , . . . , Y T } is a sequence of T observed snapshots of the network. Banks and Carley 
[22] discussed the simplest version of such models. See also [253]. 
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4.5.1 Discrete Markov ERGM Model 



Hanneke and Xing [139] proposed a natural extension of the ERGM model in the discrete 
Markov domain. Unlike the set up in the continuous domain, the potential function in this 
model involve the statistics of two consecutive configurations of the network: 

Pr(y t I y'- 1 ) = ^exp^/W^y'- 1 )}. (4-16) 

k 

Table 4.2 lists a few examples of network statistics defined on pairs of network snapshots. 



Density of edges: 


*i(y',y*- 1 ) 




Stability: 


^(y'.y*" 1 ) 


=^f:ivhv^+(i-vh)(i-v^)} 


Reciprocity: 


*s(y*,y*- 1 ) 


ij ij 


Transitivity: 


S4(y*,y t_1 ) 





ijk ijk 



Table 4.2: The table of network statistics for pairs of network snapshots. 

The basic model may be extended to allow for multiple relations, node attributes, and 
K-th order Markov dependencies of the form 

T 

Pt(Y k+1 , Y k+2 , . . . ,Y t I Y 1 , . . . , Y K ) = f[ Pr(F* | Y*- K , . . . ,Y t ~ 1 ), (4.17) 

t=K+l 

where 

Pr(Y* | Y t ~ K , . . . , y'" 1 ) = I exp{^ /3 k s k (Y\ Y t ~ K ). (4.18) 

k 

The joint distribution of the first K network snapshots may be represented by an ERGM 
for the first snapshot, and a (k — l)-th order discrete Markov dependency model for Y k . The 
paired network statistics may be extended over K network sequences. 

Maximum likelihood parameter estimates may be computed via any numerical approxi- 
mation technique such as the Newton-Raphson method. Computation of the gradient and 
Hessian requires the mean and covariance of the sequence network statistics, which are ex- 
actly computable for a pair of networks, but require Gibbs sampling in the i^-sequence case 
[139]. The likelihood of this model is well behaved if the minimum sufficient statistics involve 
only dyads, however, similar to its static counterpart, the full dynamic ERGM is prone to 
likelihood degeneracy. 
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4.5.2 Dynamic Latent Space Model 

Sarkar and Moore [264] extended the static latent space model of Hoff et al. [146] (cf. sec- 
tion 3.9) in the time domain. Recall that in the static latent space model, the log odds ratio 
of a link between nodes i and j depends on the distance between their latent positions 2, 
and Zj. The dynamic latent space model allows the latent positions to change over time in 
Gaussian-distributed random steps: 

Z t \Z t _ x ~N{Z t _ u a 2 I). (4.19) 

The observation model is a modified version of the original latent space model 1 : 

V% ■= v\Vi 3 = 1) = t- ^ y (4.20) 

J 1 + exp(f% - Uj ) 

where dij is the Euclidean distance between i and j in latent space, and is a radius 
of influence defined as c x (max(<5j, <5j) + 1) (5i and Sj being the degrees of node % and j, 
respectively). The "radius of influence" is based on the assumption that the higher the 
maximum degree of the two end nodes, the more likely the edge. This may be true in 
citation networks where prolific authors are more likely to form new co-authorships. The 
constant 1 is added to ensure that the radius is non-zero, and c is estimated from data by a 
line-search (a minimization method in one dimension). 

The link probability Py is defined to be a mixture between the modified latent space link 
probability pfj and a noise probability p. The idea is that pairs of nodes who are outside 
of each other's radius have only a low noise probability of establishing a link, while nodes 
within each other's radii follow the probability pfj-. 

Pij = nidi^Kidij) + (1 - (4.21) 

The full observation model is then 

Pr(y* \Z t ) = J[p ij ]I(l-p ij ), (4.22) 

where % ~ j denotes the presence of an edge from i to j. The latent space positions Z* are 
estimated in sequence for t — 1 ... T by maximizing the likelihood of the observed Y f : 

Z* = argmax z Pr(F* | Z) Pr(Z | Z 1 ^). (4.23) 

The authors propose conjugate gradient optimization starting from an initial estimate of 
the latent positions based on a multidimensional scaling (MDS) transform of the observed 
pairwise distances. To eliminate rotational ambiguity, a Procrustean (rotationally invariant) 
transform is applied to the MDS transform so that Z 1 is aligned with 

Applying the model to the NIPS paper co-authorship dataset (cf. subsection 2.2.6), the 
authors gave anecdotal evidence of the validity of the changing embeddings of several well 



1 Note that in this dynamic version of the latent space model, links are assumed to be undirected. 
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known machine learning researchers over time. The dynamics of the researchers' latent 
positions allowed for an insight into the evolution of the machine learning community. 

Sarkar et al. [265] also proposed a richer model based on [124], which improved upon 
previous work in two ways. One of the differentiating features of this work was the ability to 
simultaneously embed words and authors into the latent space, which allowed for representa- 
tion of a two-mode network. The major advantage, however, was the inference method — the 
authors proposed a Kalman-filter like dynamic procedure, which allowed for estimation of 
the posterior distributions over the positions of the authors in the latent space. Proposed 
procedure was applied to a simulated NIPS dataset. 

The impact of this line of work is dichotomous: first, it offers an explanation of the 
network at every time step, and second, it enables an accurate and efficient prediction of the 
state of the network at a time step in the future. The proposed inference procedures made it 
possible for network modeling to scale to large dynamic collections of data. The drawback of 
this approach is the lack of an explicit mechanism that could explain the dynamics behind 
the real networks. 

Another latent model for citation networks was developed in the physics community. 
Leicht et al. [190] proposed to use latent variables to capture the grouping of papers that 
have similar citation profiles over time. The network in this case is a directed acyclic graph 
and the nodes are papers rather than authors. Using as example a set of opinions from the US 
Supreme Court and their citations between the years of 1789 and 2007, the authors showed 
how a simple latent model was able to recover, in a completely unsupervised manner, the 
different eras in US Supreme court opinion references. The parameters of the model, except 
for the number of latent classes, were estimated using an EM algorithm. Different numbers 
of latent classes were tested and each revealed something new about the underlying data. 
The authors also compared the latent method to a clustering based on network modularity 
[233] . Even with the information about time (directionality in the graph) removed, the latent 
variable model was still able to discover the same split between two groups of opinions that 
happened around 1937. The network modularity clustering in a way validated the outcome 
of the latent model. 

In a separate experiment, Leicht et al. [190] showed that deterministic approaches such 
as "hubs and authorities" and eigenvector centrality [171] discovered interesting network 
properties that were not revealed by the statistical models. The deterministic analyses 
showed several significant drops in the age of authorities sited, meaning that once in a while, 
the younger set of opinions became the new authorities and that the process happened in 
a "decisive" manner, rather than gradually. In this way, deterministic network analysis 
approaches complement statistical models. 

4.5.3 Dynamic Contextual Friendship Model (DCFM) 

The dynamic contextual friendship model (DCFM) of Goldenberg and Zheng [128] repre- 
sents an attempt to capture several aspects of the complexity of the evolution of real social 
networks over time. In a real-life friendship network, people may meet and interact with 
each other under different contexts (e.g., school, work projects, social outings, etc.), and the 
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strength of interpersonal relationships change over time based on these interactions. DCFM 
offers such a mechanism for network evolution, where edges have weights that indicate the 
strength of the relationship, and each node is given a distribution over social interaction 
spheres (contexts). Context is defined to be any activity where people may interact with 
each other. At each given time step, each node chooses a random context according to the 
node's distribution over contexts. Nodes that appear in the same context update the weights 
of the links between them. The probability of a weight increase (or decrease) depends on 
whether the pair had a chance to meet (a coin toss in a model) and the "friendliness" pa- 
rameter of the individuals involved. The possibility of both positive and negative weight 
updates allows for edge birth and death over time. An extension of the model also allows 
for addition and deletion of nodes. 

The underlying dynamics is captured by a first-order Markov chain model. Letting W t 
denote the weighted adjacency matrix at time t, the basic generative process at time t can 
be formalized as follows: 

1. For each node i, sample context Ci ~ mult^), where 9i denotes the context distribu- 
tion parameters. 

2. For each pair of nodes i and j in the same context, sample meeting variable My ~ 
Bern(z/jZ/j), where z/j and Uj represent the "friendliness" of nodes % and j; 



where and \i are hyperparameters indicating the rates of growth and decay, respec- 
tively. The idea is that a meeting should increase the edge weight with high probability, 
otherwise the weight decays. 

The parameters 9i,Vi,\h, all have conjugate priors and are estimated through Gibbs sam- 



The model can generate networks with a number of different properties. For example, 
Figure 4.3 shows various degree distributions generated by DCFM, while Figure 4.4 demon- 
strates possible relation dynamics. Pair (47, 45) shows a brief resuming of the relationship, 
which dissolves again in the next moment. While DCFM is capable of emulating such long- 
term memory of past relationships, it does so at the cost of added model complexity. 

Few datasets contain weighted relationships. The Enron dataset (cf. subsection 2.2.2) 
contains email exchanges that can be aggregated on a weekly basis to simulate strength of 
relationships. In the NIPS dataset (cf. subsection 2.2.6), the number of joint publications 
per year can represent the strength of the coauthorship. In these cases, the DCFM contexts 
can be taken to be the topics of emails or articles, and the friendliness parameters can be 
estimated using the method of moments. 

One drawback of DCFM is its lack of identifiability; it is impossible to tell without 
additional knowledge whether an individual formed many friendships because he frequently 
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Poi(\ h (Wt~ l + l)) if M 4i = 1 
Poi(A £ (IV*" 1 )) otherwise, 



pling [331]. 
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Figure 4.3: Log-log plot of the degree distributions of a network with 200 people. Ui is drawn 
from Beta(l, 3) for the plot on the left, and from Beta(l, 8) for the right hand side. Solid 
lines represent a linear fit and dashed lines quadratic fit to the data. Contexts are drawn 
every 50-th timesteps. 



(11,33) 









i 




I 

100 


I 

200 


I 

300 
(52,49) 


I 

400 


I 

500 


I 

600 








I 




I 

100 


I 

200 


I 

300 
(47,45) 


I 

400 


I 

500 


I 

600 








i 




I 

100 


I 

200 


I 

300 
(52,53) 


I 

400 


I 

500 


I 

600 




































I 




I 

100 


I 

200 


I 

300 


I 

400 


I 

500 


I 

600 



Figure 4.4: Weight dynamics for 4 different pairs in a DCFM simulated network of 600 
people over 600 time steps. Contexts switches occur every 50-th timestep and 6 = 3. 
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changes contexts and is very friendly or because the contexts themselves tend to be large. 
Also, weighted network data are hard to come by and thus pseudo-weights often have to be 
used. 

The DCFM model is important in its own right: the life-mimicking, rich generative 
mechanism is a step towards realistic complex models that ultimately can be used to explain 
the intricacies of observed data, especially if additional information about contexts and 
individuals' friendliness is available. 
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Chapter 5 

Issues in Network Modeling 



There are a number of major statistical modeling and inferential challenges in the analysis of 
network data that go well beyond those described in previous sections of this article. These 
relate to both the quality and the ease of statistical inference and we mention a few of them 
here: 

Network Visualization. With the rise of online social networks and network modeling, 
we have seen a proliferation of visualization tools, especially those based on variations of 
constraint-based spring model algorithms, e.g., see the discussion and references in Shnei- 
derman and Aris [267]. The automated algorithms often use node degrees or some form 
of distance metric between nodes to arrange their placement. For example, SoNIA 1 is a 
popular package for visualizing dynamic or longitudinal network data; it can be used as a 
platform for the development, testing, and comparison of various static and dynamic layout 
techniques. However, little is known about how to effectively combine visualization with the 
kinds of statistical models we review here, especially if one wants to use the visualization as 
another tool in the analysis of network data. 

Computability. Can we do statistical estimation computations and model fitting exactly 
for large networks, e.g., by full MCMC methods for mixed membership and exponential 
random graph models, or do we need to resort to approximations such as those involved in 
the variational approximation employed by [8; 9]? 

For ERGM models a newly updated suite of programs and documentation is now avail- 
able [138; 157; 224; 129]. The SIENA package 2 developed by Snijders and colleagues con- 
tains a complementary suite of programs that are particularly useful for longitudinal network 
analyses (though Rinaldo et al. [251] speak words of caution). The packages are capable of 
learning networks of size up to a few thousand nodes. 

The truth is that it is unrealistic to expect that really large networks with millions of 
nodes can be estimated using exact methods. Even variational approximations, which have 

^ttp : / / sonia. Stanford. edu/ 

2 http : / / stat .gamma. rug.nl/siena.html 
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their own drawbacks such as sensitivity to the starting point, are not realizable for networks 
on a really large scale. The key to network modeling and parameter estimation is to take 
into account the sparsity that comes with size. The methods that are good on small or 
medium-sized but relatively dense networks, might be computationally infeasible or contain 
invalid assumptions for larger networks. As we gear up to model very large networks, it is 
important to focus not only on the disadvantages that size brings but also on its advantages. 

Asymptotics and Assessing Goodness of Fit. There is no standard large sample 
asymptotics for networks (e.g., as N goes to infinity) that can be used to assess the goodness- 
of-fit of models. Thus we may have serious problems with variance estimates for parameters 
and with confidence or posterior interval estimates. While a few models with a small number 
of fixed parameters have well-behaved asymptotics, the problems here tend to be the inherent 
dependence of network data and the growth in the number of parameters to be estimated 
as iV increases. Haberman [134] comments briefly on asymptotics in his discussion of the 
Pi model, and notes the similarity to issues for the Rasch model from item response theory. 
The lack of asymptotics means that we may have problems of consistency of estimators, but 
it also means that there is no standard basis for model comparison and assessing goodness 
of fit. Most other authors have addressed these issues either empirically, e.g., Hunter et al. 
[156], or not at all. 

There are two alternative approaches. We can consider assessing fit or comparing models 
using exact distributions given the minimal sufficient statistics (MSSs). This works for 
simple models but not obviously for the general class of ERGMs or most dynamic models in 
the literature. Further, for many of the models, especially those involving latent variables, 
the MSSs are the data themselves. Alternatively, we could think in terms of some form of 
cross-validation for model selection and assessment. The problem with cross-validation is 
the boundary effects associated with subsets of nodes. This is directly related to the problem 
of sampling in networks. 

Bickel and Chen [37] address the problem of asymptotics in the context of blockmodeling 
or community discovery, and the methods they exploit may be useful in a broader context 
when the number of parameters to be estimated grows as N increases. 

Sampling. Do our data represent the entire network or are they based on only a subnet- 
work or subgraph? When the data come from a subgraph, even one selected at random, we 
need to worry about the effects at the boundary 3 and the attendant biases they bring to 
parameter estimates, cf. the negative result in Stumpf et al. for scale-free models in which 
they show the extent and nature of the bias [289] . Most of the early results on sampling for 
network data focused on random subgraphs and exploited the traditional statistical theory of 
design-based sampling, in which the properties of the network are assumed to be fixed, and 
we evaluate sample quantities by considering their distribution under all possible similarly 

3 The boundary is the collection of observed nodes which have links to the unobserved nodes. The 
boundary can potentially include all observed nodes. Only nodes for which the set of known links are certain 
to be complete are not included in the boundary - the condition that is hard to satisfy in real world networks. 
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selected subgraphs. For details, see the many papers by Ove Frank [109; 295] and oth- 
ers [125; 135; 258]. Wiuf and Stumpf [325] and Stumpf and Thorne [288] recently adopted 
a related but different approach focusing on properties such as degree distributions using 
binomial random sample sizes from "large" graphs. Others such as Leskovec and Faloutsos 
[191] examine aspects of the question in an empirical but ad hoc fashion. The relevance 
of sampling for model-based network inference was first addressed by Thompson and Frank 
[295], and further developed by Handcock and Gile [135], who adapt MCMC algorithms for 
exponential random graph models to account for sampling designs. To date, these are the 
only works to seriously explore this important topic. Airoldi and Carley [6] quantify the 
sensitivity of alternative sampling algorithms to generate graphs that share similar topolog- 
ical properties, as well as the divergence of topological properties of algorithms for sampling 
popular network models. 

We expect the issue of sampling to be of relevance to virtually all of the models and 
we need to explore their consequences. This will be especially true when we try to update 
model parameter estimates based on extracts of data in a dynamic fashion. 

Missing data. Along with sampling arises a question of the treatment of missing data in 
statistical networks. Usually, the non-respondents to surveys are excluded from the analysis 
and the modeling considers only individuals for which all data is available. A few works 
deal with missing data directly. The empirical impact of nonrespondents in a survey to 
analysis is considered in [284], the modeling implications and inference for non-respondents 
in ERGM can be found in [255; 120; 178]. Missing data in longitudinal studies is the subject 
of [154]. This work makes assumptions about sampling strategies to justify the estimation 
of missing edges using a Missing at Random assumption. Because this is not in general a 
correct assumption we have an interesting set of open problems. Kossinets [180] considers 
three missing data mechanisms: network boundary specification (non-inclusion of actors or 
affiliations), survey non-response, and censoring by vertex degree (fixed choice design), and 
examines their effect on a study of a scientific collaboration network. One type of missing 
data - links or relations - can be treated as a prediction task by treating links between 
nodes in a given network as probabilistic quantities and using statistical models based on 
the available data to estimate the likelihood of those edges being there. The problem of 
prediction is often addressed in the machine learning community and we discuss it next. 

Prediction. In our review of the literature on networks across many disciplines we have 
found limited methodological work focussing on evaluating and comparing the predictive 
ability of various models, static or dynamic. There are papers on link prediction in the 
relational network model literature (e.g. [238]). Liben-Nowell and Kleinberg [198] develop 
approaches to link prediction based on measures for analyzing the "proximity" of nodes 
in a network, e.g., the WWW. In biological literature, a number of papers examine the 
problem of predicting missing links in biological networks (e.g. [327] is one of the earlier 
works). However, these papers focus on how to cleverly combine heterogeneous data in 
order to discover new links. The evaluation is usually limited to cross-validation on the 
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known links — information that is incomplete and available only for a few organisms. In 
the sociological literature on organizations, there is often interest in distinguishing among 
organizations on the basis of their network structure, so there would clearly be interest in 
utilizing methodology for prediction based on network structure. Because making predictions 
of various sorts from dynamic network models fits well within the machine learning paradigm, 
we expect to see many more papers on the topic in the not too distant future. 

Embeddability. Underlying most dynamic network models is a continuous time stochastic 
process even though the data used to study the models and their implications may come in 
the form of repeated snapshots at discrete time points (epochs) — a form of time sampling as 
opposed to node sampling referred to above — or cumulative network links. In such circum- 
stances we need to take special care in how we represent and estimate the continuous-time 
parameters in the actual data realizations used to fit models. This is known in the statistical 
literature as the embeddability problem and was studied for Markov processes in the 1970s 
by Singer and Spilerman [270, 271] for social processes, and more recently by Hansen and 
Scheinkman [140] in the context of econometric models and by others in the computational 
finance literature. Wasserman [313] and various papers by Snijders and his collaborators 
illustrate how to address embedding in some simple dynamic models. 

Identifiability. Identifiability of model parameters is a technical issue in statistics that 
refers to the fact that multiple solutions may exist (in the parametric space) that lead to 
exactly the same likelihood. In this sense, no inference procedure can distinguish between 
these solutions. For instance, in a mixture model we can permute the assignments of points 
to mixture components to obtain an equivalent solution. There are a number of papers 
that describe the issue in various models (e.g., [283; 132]) and from different perspectives 
(e.g., [51; 52] from the algebraic perspective). A few solutions to address this issue have 
been proposed recently. Some consider inference on equivalence classes in a blockmodel for 
network data [236]. Others pre-process the data to identify a reference solution that drives 
the inference [137]. 

Combining links with their attributes. In many network data sets, especially those 
arising in machine learning contexts, there are attributes associated with the network links. 
For example in e-mail and blog databases, the attributes may be taken to be the contents 
of the messages or postings. There is an emerging literature focused on cascades of such 
links but few papers are situated in a full network model setting and few authors attempt 
to combine the models for links with models for message or posting texts. This is a natural 
extension to models described here, especially the mixed membership stochastic blockmodels 
of section 3.8, since the text could naturally be modeled by mixed-membership topic models. 
McCallum et al. [208] and Chang and Blei [61] suggest different ways to approach this kind 
of combination model. Dynamic models that combine evolving block and topic structures 
would be of special interest for such applications. 



60 



Chapter 6 
Summary 



The ubiquity of networks in areas as diverse as the social sciences, biology, computer science, 
physics, and economics, has spawned extensive literature on the subject. In this review, 
we discussed in detail a few main trends in the statistical network modeling literature, 
focussing on models that have historically inspired many others as well as a few recent 
proposals. By charting the evolution of statistical network modeling approaches, we pointed 
out explicit connections between the discussed models. Figure 6.1 provides a visual diagram 
of model influence; an arrow pointing from A to B means either that the development of 
model A influenced the subsequent development of model B, or that B can be viewed as a 
generalization of A. 

The literature on network modeling may be divided along different lines of motivation. 
Models primarily introduced in the physics literature are motivated by asymptotic properties 
of networks, whereas the literature stemming from statistics and statistical social science is 
concerned with the inference step in addition. Thus, the main criticism of the random graph 
models primarily developed in statistical physics is the lack of the assessment of the fit of 
the models to the data. The main drawback in the statistical literature is the lack of the 
comprehensive asymptotic analysis. Though degeneracy found in the limiting case of the 
earlier versions of the ERGM has been addressed, a more broad analysis is still missing. 

In this work we made a distinction between static and dynamic models. Descriptive 
models such as p±, P2, and ERGM are clearly static as they infer a set of sufficient statistics 
from a single snapshot of an existing network. The families of continuous and discrete time 
Markov models, on the other hand, are clearly dynamic as they seek to model multiple 
snapshots of an evolving network. The Erdos-Renyi-Gilbert, preferential attachment, and 
small-world models, while ultimately aim to model a single time point snapshot of a network, 
are usually described via generative processes, where edges are added one at a time. These 
models can thus be considered as either static, with respect to what they model, or dynamic, 
with respect to how they're represented. In this work we refer to them as pseudo-dynamic. 

Within the category of static models we discussed two main directions: models that take 
networks as given (see section 3.4, section 3.5, and section 3.6) and models that assume 
and estimate latent structures (section 3.8 and section 3.9). Latent structure models have 
to make certain assumptions about the data. Stochastic blockmodels assume structural 
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Erdos-Renyi-Gilbert random 

graph models 
(Gilbert 1959, Erdos-Renyi 1959) 





Preferential attachment model 
(Barabasi and Albert 1999) 



Duplication attachment model 
(Kumar et al 2000, Wiuf et al 2006) 



Latent space models 
(Hoff, Raftery, Handcock 2002, 
Handcock et al. 2007) 



p 1 models 
(Holland and Leinhardt 1981 



Dynamic latent space model 
(Sarkar and Moore 2005) 




p* models / ERGM 
(Frank and Strauss 1986) 



Mixed membership blockmodel 
(Airoldi et al, 2008) 



Exchangeable graph model 
(Airoldi 2009) 



Continuous Time Markov Models 

(Holland and Leinhardt 1977, 
Wasserman 1977, Snijders 2005, 2006) 



Dynamic contextual friendship model 
(Zheng and Goldenberg 2006) 



Small-World studies 
(Milgram 1967) 



Small-World model 
(Watts and Strogatz 1998) 




p 2 random effects model 

(van Duijn, Snijders, Zijlstra, 
2004, 2006) 



Discrete Markov ERGM 
(Hanneke and Xing 2006) 



Figure 6.1: Network summarizing the relations between models discussed in our review. 
White nodes denote static models, yellow nodes - "pseudo-dynamic" and green - dynamic 
models. Arrows indicate inspiration or influence of the model at the source on the model at 
the target. 



equivalence of the nodes, whereas latent space models assume the existence of an embedding 
of the network in a low dimensional space. These models allow for better understanding of 
the data in cases where it is believed to contain hidden structure. 

We divided the category of dynamic models into continuous time Markov models and dis- 
crete time Markov models. CMPM (section 4.4) assumes that the adjacency matrix evolves 
according to a continuous Markov chain whose intensity matrix can depend on various edge 
and node dynamics. Discrete time Markov network models deal with a set of network snap- 
shots observed at various time points. Examples of discrete time Markov network models 
include dynamic extensions of ERGM (subsection 4.5.1) and the latent space model (sub- 
section 4.5.2), the duplication-attachment model, as well as a generative dynamic model for 
friendship networks (subsection 4.5.3). 

Despite the many advances in network modeling over the last decade, there remains a 
host of unresolved issues. We listed some of the issues in chapter 5. We feel that, from a 
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statistics or machine learning perspective, the biggest breakthroughs are to be made in the 
areas of inference and dynamic modeling. Creating a model or perhaps fixing an existing 
one in such a way that provides realistic generative and inference mechanisms which can 
identifiably infer parameters of a large real world network would make a great contribution 
to the statistical network modeling community. 
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